“ Students Library of Education 


An Introduction 
to Educational 


Measurement 


Douglas Pidgeon and 
_ Alfred Yates 


AN INTRODUCTION TO 
EDUCATIONAL MEASUREMEN 


At a time when the current publi 
examination systems are coming unde 
close scrutiny this volume discusse: 
completely and fully the merits anc 
demerits of various methods of edu- 
cational measurement, starting with 
the theories from which they evolved 
and going on to observe their efficacy 
in practice. School-based systems of 
measurement are also studied, both as 
separate entities and in comparison 
with the public examinations at 11+, 
CSE and ‘O’ and ‘A’-levels of the 
GCE. 

The subject-matter of this book must 
affect students, practising teachers and 
educationists and, whatever their 
views, they will be able to use the 
authors’ exposition of this controver- 
sial subject as a starting point for 
further and deeper discussion. 


16s Od net 


An Introduction to Educational Measurement 


THE STUDENTS LIBRARY OF EDUCATION 


GENERAL EDITOR: 
Emeritus Professor J. W. Tibble 
University of Leicester 


EDITORIAL BOARD: 
Psychology of Education: 
Professor Ben Morris 
University of Bristol 


Philosophy of Education: 
Professor Richard S, Peters 
Institute of Education 
University of London 


History of Education: 
Professor Brian Simon 
University of Leicester 


Sociology of Education: 
Professor William Taylor 
Institute of Education 
University of Bristol 


m studies, 
and other more general topics a methodology 


re the ibili 
the General Editor, Professor Tipple ° POPSibility of 


Ht 


An Introduction to Educational 
Measurement 


by Douglas Pidgeon 
Deputy Director, National Foundation for 
Educational Research 


and Alfred Yates 


Department of Education, 
University of Oxford 


LONDON 
ROUTLEDGE & KEGAN PAUL 
NEW YORK: HUMANITIES PRESS 


First published 1968 
by Routledge & Kegan Paul Ltd 


Broadway House, 68-74 Carter Lane 
London, E.C.4 


Printed in Great Britain 
by Northumberland Press Limited 
Gateshead 


© Douglas Pidgeon and Alfred Yates 1968 


No part of this book may be reproduced 
in any form without permission from 
the publisher, except for the quotation 
of brief passages in criticism 


SBN 7100 6247 8(C) 
SBN 7100 6246 X(P) 
ae ~, ~ - am 
Owing. to production 
delays this book’was „~ 


] published, in 1969 v 


TGE RT 


Arc. No RAST 


T., West Bengal P ID 


General editor’s introduction 


There is no more contentious subject in education than that 
of examinations. As is common in such controversial mat- 
ters the light shed on the topic is inversely proportional to 
the heat generated by it. What is often completely obscured 
is that a great deal of research has been done in this field, 
that conventional examinations are only one form of mea- 
surement in education, and that innovations offering im- 
portant alternatives to common practice are now being 
given a great deal of attention. Even more significant is the 
fact that it is seldom realised that what the argument is 
mainly about, concerns the functions that examinations 
and tests are called upon to fulfil, rather than their technical 
worthiness as measuring instruments. 

Whatever a teacher’s personal views about the role of 
measurement in education and the place of examinations, 
the fact remains that few escape some involvement in these 
matters. In the interests of educational justice it is the plain 
duty of all teachers to have some grasp of the principles 
involved in educational measurement and some awareness 
of the commoner pitfalls which beset those who construct 
even the simple tests or even attempt to ‘mark’ essays. 

From their long experience in this field and of working 
with teachers in the design and conduct of educational 
experiments, Douglas Pidgeon and Alfred Yates have 
attempted to make plain’ the essentials of educational 
measurement in lucid prose. and with the help of straight- 
forward numerical examples. Students in training:as well as 
experienced teachers will find here an. explanation: of all 
the fundamental ideas they need to know about as well as 
wise observations about the purposes and limitations of 


various techniques. 
BEN MORRIS 


Contents 


General editor's introduction 


External examinations 
Examinations—some criticisms 
Examinations—some advantages 
Teachers’ attitudes to examinations 


The nature of measurement 
Measurement in schools 
Levels of measurement 


Designing an examination 

Kinds of measurement 

The design of school examinations 
Choice of questions 


Dealing with marks and scores 
The distribution of scores 
Central tendency 

The calculation of the mean 
Spread or variability 


The calculation of the standard deviation 


Some statistical concepts 
Correlation 

Sampling and errors of measurement 
Statistical significance 


The efficiency of measurement 
Item analysis 

Reliability 

Validity 


Expressing the results 
Limitations of percentages 


co 


NTENTS 

Mental ages page 72 
Percentile ranks 75 
Standard scores 76 
Varieties of measurement 83 
What to measure 83 
The structure of abilities 86 
Attainments and aptitudes 89 
Interests and attitudes 93 
The problem of moderation 97 
School-based examinations 97 
The process of moderation 100 
Moderation without a scaling instrument 102 
A simplified procedure 104 
Some current experiments 107 
Conclusion 109 
School-based examinations in secondary schools 109 
The eleven-plus examination 110 
Guidance and counselling 110 
Changes in methods of teaching III 
Curriculum evaluation III 
Research 112 
Some necessary precautions 113 
Suggestions for further reading 115 


1 


External examinations 


Examinations—some criticisms 


Educational measurement is not accorded a generous 
measure of public esteem, largely because it tends to 
be identified with the complex system of public examina- 
tions which, in this country, regulate educational oppor- 
tunity and vocational choice. Furthermore, these exami- 
nations are perceived as conflicts: on one side parents, 
teachers and children form a defensive alliance; on the 
other, the examiners and the organisations they serve are 
regarded as the common enemy. The defects of this 
arrangement have been widely publicised. Parents see 
their children becoming over-burdened and depressed dur- 
ing the months of preparation for the final encounter; 
summer holidays are ruined by the anxious waiting period 
between the examination and the publication of the 
results; and the cumulative effects are such that a by 
no means negligible proportion of University students 
eventually break down under the strain. In these circum- 
stances it is scarcely surprising that tests and examinations 
tend to fall into disrepute. , 

Some of the criticisms that are levelled against exami- 
nations are manifestly misplaced. Their appropriate target 
is often not the examination itself but rather the purpose 
that it is designed to serve. The ‘eleven-plus’ examination 
is a case in point. At the height of the controversy that 
surrounded this examination some bitter attacks were 
launched against the tests—in particular, the standardised 


tests of verbal reasoning or ‘intelligence’-—that were used 
I 


EXTERNAL EXAMINATIONS 


by the majority of local education authorities. It has since 
become clear that it was the policy of selection, rather 
than the means chosen to implement it, that aroused 
parental indignation. We are now busily engaged on a 
reorganisation of our system of secondary education which 
will allow primary school leavers to be allocated to 
secondary school courses without being submitted to an 
external examination. The fact that some of the tests that 
are used to form part of the eleven-plus examination 
will probably continue to be administered within schools 
for the purposes of educational guidance seems unlikely 
to cause any public outcry. Indeed, there is considerable 
evidence available to show that neither teachers, parents, 
nor the children themselves find these tests in any way 
objectionable. It was their association with a crucially 
important decision affecting a ten-year-old child’s educa- 
tional and vocational opportunities—and consequently 
his social status—that aroused opposition. 

A second objection that is commonly levelled against 
examinations is that they often fail adequately to serve 
the purposes for which they are employed. For example, 
although G.C.E. ‘A’ level examinations do not attract 
the volume of opposition that was encountered by the 
‘eleven-plus’ procedure, they are nevertheless subjected to 
a certain amount of criticism on these grounds. Whilst 
some of these examinations are accepted by many teachers 
as providing a reasonably adequate appraisal of sixth form 
work, the practice of using the results to effect fine dis- 
criminations among the candidates and, furthermore, to 
determine admission to universities and other institutes 
of higher education on the basis of these distinctions is 
viewed with misgiving. 

That they are sometimes employed as means to an 
undesirable end and that they are sometimes inefficient 
are, of course, not the only objections that are raised 
against examinations, There are many experienced teachers 
and educationists who sincerely believe that some exami- 
nations foster undesirable habits in pupils and serve to 


hinder the fulfilment of important educational aims. 
2 


EXTERNAL EXAMINATIONS 


They point out that ‘cramming’ and teaching are not 
synonymous terms and that it is the former activity that 
examinations encourage by putting a premium on the 
slick reproduction of passively received and often ill- 
digested information. They have a correspondingly limit- 
ing effect on a teacher in that they restrict his freedom 
to design his own syllabus or to treat the subject-matter 
in the manner that he deems appropriate. Examinations 
would clearly deserve to be roundly condemned if they 
demanded this kind of restriction—the kind that is epito- 
mised in the phrase ‘backwash effect’. There are two 
answers that can be made to this charge. The first is 
that teachers are free agents and are not therefore com- 
pelled to react in what may be deemed to be the orthodox 
manner to the requirements of an external examination 
at the end of a particular course. That examinations call 
for cramming is not so much a fact as an interpretation. At 
all levels—from the primary school to the university— 
there are some teachers who ‘stick to the syllabus’ and 
indulge in a narrow form of preparation for the next 
examination, and there is a bolder minority who are 
guided by their conceptions of their pupils’ educational 
needs and who regard the demands of the examiners as a 
secondary consideration. Nor does it follow that the latter 
fare worse in terms of examination successes. This is an 
argument, however, that many teachers find unconvincing. 
Perhaps the second answer to this charge will receive 
broader approval. This is that ‘backwash’ effect is not 
necessarily harmful. The fact that there is an examination 
to be faced at the end of a course will undoubtedly affect 
the activities that teachers and pupils jointly undertake, 
but there seems to be no reason why these encouraged 
activities should not be of an educationally desirable kind. 
If a particular examination appears to encourage fact- 
grubbing and rote memorising, whereas the teachers con- 
cerned have sought to encourage the formation of broad 
concepts and problem-solving skills, they are justified in 
condemning this examination as being ill-conceived and 


inappropriate but not in dismissing all examinations as 
3 


EXTERNAL EXAMINATIONS 


worthless or harmful. On the contrary, it is arguable that 
some external examinations confer benefits on society at 


large, on the pupils in schools and on those who teach 
them. 


Examinations—some advantages 


The kind of society in which we live demands from its 
members a diversity of specialised functions. This creates 
problems of selection and allocation which are of course 
aggravated by the fact that a higher status is accorded 
to some functions than to others. One might conceive 
of some Utopian situation in which for every vacancy 
in any organisation there was one recognisably ideal can- 
didate. In the less than perfect world with which we have 
to make do there will inevitably be a degree of competition 
involved in the arrangements we make for the division of 
labour. For certain kinds of occupation and for promotion 
within them there are always more claimants than can 
be accommodated. In this kind of situation there would 
seem to be no acceptable alternative to the use of tests 
or examinations of some kind. We do not have to ask, 
for example, what it would be like to try to staff the 
administrative branch of the Civil Service without recourse 
to these devices, We have tried it. The result was patron- 
age, nepotism and inefficiency. In the ordering of public 
life examinations have demonstrably promoted social 
Justice, and have helped us to make progress towards the 
goal of equality of opportunity. 

The competitive situation in the world outside is, of 
course, reflected in the educational system. Within this 
there are choices to be made; one university is considered 
to be more desirable than another; some types of secon- 
dary school attract more would-be entrants than they can 
accept; and, inside the schools and other institutions, the 
allocation of pupils and students to the different courses 
that are available cannot always be made to the complete 
satisfaction of all concerned. In these circumstances selec- 


tion rather than guidance becomes inevitable and if justice 
4 


EXTERNAL EXAMINATIONS 


is to be done and is to be seen to be done, it is arguable 
that examinations must play a part. It is not perhaps 
generally accepted that external examinations confer 
benefits on the pupils who are entered for them. It must be 
admitted, however, that few individuals, even adults with 
scholarly inclinations, work effectively and with sustained 
application if they feel entirely free from obligation to 
external demands of some kind. Even creative artists and 
writers testify to the value of some form of discipline. 
Many a novel would have remained unfinished if its 
author had not been harried by a publisher and committed 
to a specific deadline. For a pupil or student tests and 
examinations help to determine the way in which time 
is organised: they compel the adoption of systematic 
study-habits which few of them would achieve in the 
absence of periodic threats of exposure. It is unfashion- 
able to discuss scholastic work in these terms, of course. 
It is ideally characterised by joyful spontaneity. For most 
frail mortals, however, an appointment with an examiner 
Proves to be a useful additional source of motivation. 

A further benefit that examinations provide—and this 
applies even to those rare individuals who are able to 
work steadily and systematically without some incentives 
—ls that they compel pupils and students not only to 
acquire knowledge and skills but to reproduce their know- 
ledge and to apply their skills. The old pedagogic slogan 
that there is no impression without expression contains 
a truth that we all acknowledge. This was obviously re- 
Cognised by the small girl who asked: how do I know 
what I think until I see what I say? The ill-organised 
Tesults of a period of study are often transformed into 
a much more meaningful pattern by the attempt to com- 
Municate them to someone who is empowered to assess 
Our understanding. 

It is also useful for a pupil or student to be able to 
Obtain from time to time an objective and independent 
estimate of his progress and attainments and to be able 
to compare himself in these respects with his contem- 


Poraries. The damage to morale or even to mental health 
5 


EXTERNAL EXAMINATIONS 


that might result from unfavourable comparisons is often 
stressed by those who object to examinations, but it may 
sometimes be in an individual's best interests to discover his 
true status. In some instances an adverse result may 
bring home to him the need to intensify his efforts or 
to modify his approach. The effectiveness of teachers, 
also, can be improved by the discipline involved in pre- 
paring pupils for an examination, irksome as this require- 
ment may sometimes seem to be. The implications are 
much the same for teachers as they are for pupils. Work 
has to be planned systematically and executed thoroughly. 
And just as pupils need to compare themselves with others 
from time to time, to enable them to secure a realistic 
estimate of their strengths and weaknesses, so teachers 
can benefit from discovering how their colleagues fare in 
carrying out similar tasks. This can be a useful check 


on the effectiveness, for example, of one’s methods of 
instruction. 


Teachers’ attitudes to examinations 


On the face of it this brief review would seem to indicate 
that the advantages to be derived from our system of 
public, external examinations outweigh its disadvantages 
and that therefore those teachers who adopt a suspicious 
or hostile attitude towards educational measurement are 
being somewhat perverse. After all, we have shown that 
many of the criticisms can be adequately met in that 
they do not apply to examinations per se but rather to their 
misuse, in particular circumstances, or to their technical 
deficiencies, which can obviously be at least partially 
remedied. On the other hand, some of the benefits claimed 
on their behalf would seem to be substantial. Be that as 
it may, there is clearly one major defect in the arrange- 
ments we are discussing. Teachers can scarcely be expected 
to _become enthusiastic about educational measurement 
if it is regarded as an activity wholly distinct from the 
Process of teaching. It is clearly the separation of these 


pe functions which has coloured the attitude of teachers 


EXTERNAL EXAMINATIONS 


towards tests and examinations in general. As far as 
public examinations are concerned the teacher works with 
his pupils against the examiner. The latter is perceived 
as a somewhat sinister, anonymous figure who has to 
be satisfied or, perhaps, outwitted. This attitude often 
affects even those tests or examinations administered with- 
in schools and for which teachers are themselves respon- 
sible, Many teachers regard school examinations as a 
distasteful interruption of what they conceive to be their 
true function, not only because of the extra labour that 
is involved but also, one suspects, because they dislike 
casting themselves in the role of examiners. Normally, 
they regard themselves as their pupils’ friends. To become 
a temporary examiner is rather like deserting to the other 
side. Many of them assume an uncharacteristically for- 
bidding expression for the occasion and are greatly relieved 
when the wretched business is over and they can return 
to their proper calling. 

There have been unmistakable signs in recent years 
of an increasing awareness of the need to associate the 
functions of teaching and examining more closely. The 
eleven-plus examination, for example, began as an external 
examination administered by local education authorities, 
with teachers cast in their familiar role—that of co- 
Operating with their pupils in a combined effort to satisfy 
the examiners’ demands. The whole operation was con- 
ducted like a military campaign: the enemy's strategic 
plans were discovered—by studying the papers set In 
Previous years—and the troops were either trained to make 
a frontal assault or, in some instances, coached to employ 
adroit flanking movements. Then a development took 
place which significantly changed the character of the 
conflict: In the interests of predictive efficiency, authorities 
began to introduce into their procedures assessments, fur- 
nished by teachers, of each pupil's suitability for an 
academic type of secondary education. The majority of 
authorities eventually employed these assessments In one 
form or another and, in some areas, they became the 
predominant factor in determining the allocation of child- 


7 


EXTERNAL EXAMINATIONS 


ren to secondary schools. Thus many primary school 
teachers had to adjust to a wholly new and, at times, 
rather embarrassing situation. After ostensibly working 
against the examiners, so to speak, they found themselves, 
first of all co-operating with them and, later, virtually 
taking over their role. These developments required a 
fundamental change in their attitude towards this aspect 
of educational measurement. 

A comparable innovation has recently been introduced 
into the arrangements for examining pupils at the end of 
their secondary school course. When the examinations 
for the New Certificate of Secondary Education were 
first proposed there seemed to be a strong likelihood that 
the restrictive influences attributed to the established forms 
of external examinations—those leading to the G.C.E. for 
example—would be extended to many more schools. It 
Was envisaged that the number of pupils and teachers 
who would find their activities largely governed by sylla- 
buses and examinations devised by agencies outside the 
schools would be more than doubled. Strenuous efforts are 
being made, however, to persuade teachers themselves to 
undertake the responsibility of examining their own pupils 
for this new award. In various parts of the country €x- 
periments are being carried out to determine the extent to 
which this can be satisfactorily organised. 

The possibilities of ‘school-based’ examinations have 

en investigated. These are examinations devised, admini- 
stered, and assessed by the teachers within a particular 
school. The results are checked and, where necessary; 
‘moderated’ by their 
to ensure parity of standards throughout the area. A 
further process of moderation to adjust for any difference 


between areas or regions can lead to a valid system of 
national awards. This 


which could have far- 
Prediction is that teachers will be encouraged to make 4 


pong confer could be preserved and their disadvantages 


EXTERNAL EXAMINATIONS 


—notably their adverse ‘backwash’ effect—largely 
eliminated. 

It is of course abundantly clear that if, as we have sug- 
gested, large numbers of teachers are likely to become 
more closely involved in those activities which were 
previously regarded as the province of external examiners, 
they will need to modify their attitude towards the prob- 
lems involved in constructing tests and examinations for 
application on a large scale. When they were at the 
receiving end, so to speak, they were not obliged to 
concern themselves deeply with these matters. It may well 
be necessary, in the future, for them to acquire, as part 
of their professional training, both a deeper insight into 
the principles and some experience of the techniques of 
educational measurement. 

There are two further cogent reasons why teachers 
should be equipped with this kind of knowledge and 
expertise. The first is that the volume of educational re- 
Search that is carried out in this country is growing at a 
rapid rate. Much of this research involves broad surveys 
in which tests and other measuring instruments are admini- 
stered to pupils in large samples of schools. Teachers will 
thus find themselves increasingly concerned with research 
of this kind in that they will often be invited to participate 
in it and also because its results may significantly affect 
School organisation and methods of teaching. In these 
Circumstances, they clearly need to develop some under- 
Standing of the kinds of measurements employed in 
educational research so that they can appreciate its objec- 
tives, co-operate effectively in its conduct and submit 
the results that it yields to critical appraisal. Finally, 
teachers need to be acquainted with the established prin- 
Ciples and practices of educational measurement, because, 
whether or not they become involved in examining on a 
large scale or with educational research, they are Im- 
€vitably responsible for carrying out a wide variety of 
assessments within their own schools and classrooms as 
Part of their day-to-day activities. Some of these measure 
ments can significantly affect the progress and — 

IEM—B 


EXTERNAL EXAMINATIONS 


of the children on whose behalf they are undertaken, and 
they should clearly therefore be as valid and accurate as 
possible. The experience that has been accumulated as a 
result of administering tests and examinations to large 
numbers of pupils for a variety of purposes, together with 
the large amount of research that has been devoted during 
the past few decades to the problems of educational and 
psychological measurement have led to the development 
of methods of assessment which can, with advantage, 
be adapted by teachers to serve their own purposes. It is 
with this possibility—that a closer acquaintance with the 
principles and practice of educational measurement will 
enable a teacher to form a sounder judgment of his pupils’ 


characteristics and educational needs—that this booklet is 
mainly concerned. 


I0 


2 


The nature of measurement 


Measurement in schools ‘ 


We noted in the previous chapter that many teachers are 
not only critical of external examinations but even betray 
a lack of enthusiasm for those kinds of educational mea- 
surement which they may be required to undertake within 
their own schools and classrooms. We suggested that one 
reason for this attitude was that teachers have grown 
accustomed to regarding examinations as obstacles which 
outsiders interpose and their own proper function in this 
respect as that of assisting their pupils to overcome them. 
There are other grounds also on which some teachers 
base their objections to educational measurement. The 
first of these is that measurement inevitably involves 
mathematical concepts and symbols and, for some, this 
provides adequate justification for contracting out of the 
enterprise. This, however, is an unnecessarily pessimistic 
view. Any teacher is capable of making intelligent use of 
techniques of educational measurement and of adequately 
interpreting the results that they yield if he is prepared 
to combine common sense with the application of a few 
simple arithmetical procedures, and to accept certain 
assumptions on trust. We can certainly promise that no 
unreasonable demands will be made on his mathematical 
capabilities during the course of this introduction. 

The second objection is one that cannot be disposed of 
quite so readily. This is based on the widespread belief 
that measurement can be effectively applied only to 


the most trivial aspects of human behaviour, whereas 
II 


\ 


THE NATURE OF MEASUREMENT 


teachers are concerned with qualities and attributes of a 
complex and elusive kind. It is generally accepted that 
a child’s height and weight can be adequately measured. 
There are instruments and scales which everybody under- 
stands and knows how to use. People become a little more 
dubious when it is suggested that intelligence can also 
be measured and incredulous to the point of indignation 
if it is claimed that such characteristics as anxiety, aggres- 
siveness, social adjustment or happiness might be similarly 
treated. In these instances, measurement is regarded as 
irrelevant, if not positively irreverent. 

We believe that this objection can be satisfactorily met 
but, since it is based on a misconception of the fundamen- 
tal nature of measurement, we do not propose to attempt 
an immediate answer but rather to undertake a brief 
examination of the general features of measurement as 
applied to education—which we hope will help to clarify 
its possibilities and limitations and, within which, an 
Po. to this specific objection will emerge. 

Our thesis is that/measurement serves to improve the 
efficiency of educational guidance. This, in its broadest 
sense, involves ensuring that each individual pupil is 
continuously accorded the kind of treatment best suited 
to his needs and capabilities. To this end teachers are 
concerned with a succession of judgments and decisions. } 
Some of these are of a general kind: they need to deter” 
mine the form of organisation that will serve their pupils’ 
interests—how, and on what basis, for example, they are to 
be assigned to different classes, streams or sets or grouped 
within classes for instructional purposes; they have to 
choose suitable equipment, materials and methods and 
perhaps to revise their choice if they discover that satis- 
factory progress is not being made. They are also called 
upon to make decisions, almost daily, concerning the 
treatment of individual pupils, not only with reference 
to their scholastic progress but also in order to pro- 
mote their mental health and the development of their 
characters. 


These decisions require judgments of various kinds. Just 
12 


THE NATURE OF MEASUREMENT 


as we should regard a doctor as being somewhat obscuran- 
tist if he maintained that treatment was his proper 
function and on that account refused to concern himself 
with the use of thermometers and stethoscopes, so we 
might expect a teacher to accept assessment as an integral 
part of his professional responsibilities. A doctor carries 
out measurements to determine the kind of treatment an 
individual patient requires. He repeats these measurements 
or carries out fresh ones during the course of the treat- 
ment to ensure that the patient is responding satisfactorily 
and again at the end to check its outcome. A teacher 
is involved in an analogous situation and needs to make 
assessments for comparable purposes. 

This of course is generally recognised. There can be 
few if any teachers who carry out continuous instruction 
for lengthy periods without pausing to discover the extent 
to which their pupils are acquiring the knowledge and 
skills that are being purveyed. Nor are the other decisions 
that we have been discussing made arbitrarily. Teachers 
have a good deal of opportunity to observe their pupils 
and, as a result, are able to form judgments which in 
many instances are valid and accurate. The point at issue, 
therefore, is not so much whether or not teachers need 
to carry out assessments of some kind—this they regularly 
do—but whether or not some of these assessments would 
be more serviceable if they were based on measurement 
rather than on unsystematic observation and subjective 
impressions. 

As we have admitted, there are occasions when precise 
measurement is unnecessary. If a class photograph is 
being taken and the teacher is asked to arrange for the 
taller children to stand at the back and for the smaller 
ones to sit in front he would scarcely find it necessary 
to send all the children into the gymnasium to have 
accurate measurements made of their height. He would 
be able to carry out a rough but adequate classification 
On the spot. On the other hand, if he were asked to report 
on the growth of a particular child over a relatively short 


period—to assess, for example, the extent to which he 
13 


THE NATURE OF MEASUREMENT 


was benefiting from a particular diet or regimen—he 
would not rely on his unaided judgment. Thus reasonably 
precise measurement is called for if we are concerned with 
relatively fine discrimination. In these circumstances it is 
clearly useful if we can express the attributes under review 
in quantitative terms. 

Measurement is also more serviceable than intuitive 
judgment if we are reporting our assessments to another 
person. Suppose that a child is being transferred from 
the care of teacher A to that of teacher B. A reports ‘I 
have watched this child carefully. He is definitely aggres- 
sive.’ Such a report is clearly of limited value. B may 
conclude that since A has watched the child over a period 
his judgment may be more useful than one that was based 
on a five-minute interview, but he might understandably 
feel dissatisfied with it nevertheless. The report provides 
no indication of what A understands by aggressiveness. 
He might entertain a wholly idiosyncratic notion of what 
this attribute involves. He might be disposed to regard as 
indices of aggressiveness modes of behaviour that the 
majority of us classify in some other way. B is provided 
with no information about the basis on which A’s judg- 
ment was formed and furthermore he is told nothing about 
the degree or extent of the child's aggressive tendencies. 

Now let us suppose that A had furnished a different 
Kind of report: ‘My colleagues and I have recently 
attempted to measure the aggressiveness of our pupils. 


We drew up a list of items of behaviour which seemed 
to us to be classifiable 


transferred to your class was 47- 


There may well be much to criticise in this report. One 
14 


THE NATURE OF MEASUREMENT 


could perhaps argue that the signs of aggressiveness that 
are quoted are not the only or even the most important 
manifestations of this attribute. One might question the 
methods of observing and ‘scoring’ used in the assessment. 
It is reasonably clear, however, that the person who re- 
ceived a report of this kind is furnished with relatively 
clear and unambiguous information. 

This example serves to illustrate the essential features of 
measurement and to distinguish it from subjective impres- 
sion and intuitive judgment. A succinct definition of 
measurement has been offered by Stevens (see page 115). 
He characterises the process as that of assigning numerals 
to objects and events according to rules. The second 
of the two reports we have been considering satisfies this 
definition at all points. The first report left the recipient 
completely in the dark as to whether the child in question 
could be regarded as excessively or only moderately 
aggressive. In the second an assigning of numerals has 
taken place and evidence provided, therefore, to suggest 
that he manifests this attribute to a considerable degree. 
Again the first report claims that observations have been 
made but fails to specify the objects and events that have 
been noted, whereas the second provides a list of the kinds 
of behaviour that have been taken into account. Finally the 
Second report indicates the rules that have been applied 
to regulate the collection of evidence and to determine 
the final ‘score’. 

This example may also serve to counter the argument, 
to which we referred earlier, that measurement can be 
applied only to relatively trivial aspects of human be- 
haviour, Clearly any attribute can be measured if we can 
satisfactorily identify its symptoms, so to speak. If we can 
agree about what are the unmistakable signs of agres- 
Siveness, anxiety, intelligence or what you will, we can 
then make arrangements to observe these signs. To satisfy 
the rest of the definition—that is to draw up clear rules 
which can be followed by others for carrying out these 
Observations, and to express our conclusions in quan- 


titative terms—is often the least difficult part of the 


THE NATURE OF MEASUREMENT 


lem. In other words, to assert that, for example, 
aggressiveness or intelligence cannot be measured is tanta- 
mount to saying that we do not really know what we 
are talking about when we use these terms. This of course 
is very frequently the case but, paradoxical as it might 
seem, attempts to measure an abstruse quality often help 
towards a fuller understanding of its nature and function. 
There is nothing quite like measurement for clarifying 
the concepts. 


Levels of measurement 


Measurement, according to the definition we have accepted, 
involves assigning numerals to objects and events according 
to rules. It is this process of assigning numerals that we 
now propose to examine a little more closely. Stevens 
distinguishes four levels of measurement or four ways in 
which the process of assigning numerals to objects and 
events can be carried out. It is important to be able 
to recognise these distinctions because the interpretation 
of the results that are derived from any form of measure- 
ment and the kind of operation that can legitimately be 
performed on them is governed by the level of measure- 
ment involved. This association may become clearer as 
we discuss the characteristics of the four levels of measure- 


ment which have been labelled nominal, ordinal, interval 
and ratio scales. 


Nominal scales. This is the most primitive scale in the 
hierarchy and involves assigning numerals to objects and 
events simply as a convenient way of labelling and 
identifying them. We can distinguish between buses on 
different routes by calling one 14 and the other 15. We 
can discover the identity of a footballer by looking at 
the number on the back of his jersey. It is obvious 
however that numerals used in this Way are not expres- 
sions of size or quantity. A number 1s bus cannot be 
assumed to be one degree faster or more comfortable 


ma a number 14 and, a team’s inside-right, labelled 


THE NATURE OF MEASUREMENT 


number 8 cannot be assumed to be twice as skilful or 
as valuable in the transfer market as the right-half who 
wears the number 4 jersey. Clearly none of the basic 
arithmetical operations can be performed on numerals 
employed in this fashion: it would be meaningless, for 
example, to add them, subtract them, or to compute 
averages. Numerals may of course be employed in this 
way to label classes of objects or individuals—batches 
of goods in a factory or groups of children in a school. 
In this instance the process of counting becomes permis- 
sible. If a teacher assigns numbers to three distinguishable 
groups of children in his class—those staying to lunch 
every day (1); those staying on Mondays and Fridays 
only (2); and those who go home to lunch every day (3}— 
he cannot carry out any meaningful calculations involving 
these numerals but he is entitled to do so with the numbers 
of children belonging to each category. If he found that 
there were ro children in group 1, and 5 in group 2 and 
15 in group 3, nobody is likely to question his conclusion 
that half his pupils never stay to lunch. 


Ordinal scales. An ordinal scale involves arranging objects 
or individuals in order according to the degree to which 
they display a particular characteristic. Thus, objects may 
be ranged in order of size or pupils in a class ranked in 
terms of the degree of effort they put into their school 
work—the most conscientious and persevering pupils be- 
ing placed high in the order and the feckless and indolent 
Occupying the lower positions. Numerals can be assigned 
in these circumstances—1 to the child at the top of the 
list, 2 to the next in order and so on—and so one of 
the conditions of measurement can be satisfied. It is obvious 
too that these numerals, unlike those involved in nominal 
Scales, indicate quantitative distinctions. We do not label 
a centre-forward number 9 and an outside-left number 11 
in order to suggest that the former is a superior player. 
But if one child is placed gth in an order of merit and 
another rth we are asserting that they differ with respect 


to some kind of performance or characteristic. Never- 
17 


THE NATURE OF MEASUREMENT 


theless, although in ordinal scales we use numerals to 
indicate differences of quality or degree, we are still not 
justified in applying to them the processes of addition, 
subtraction, etc. This is because the intervals in an ordinal 
scale are indeterminate. We may be satisfied that the child 
who comes second in order of merit is superior to the one 
who comes third but we cannot state by how much. The 
size of the interval may vary—and usually does—at dif- 
ferent points in the scale. Thus the gap between the 
second child and the third may well be greater than that 
between, say, the fifteenth child and the sixteenth. Interval 
sizes may also vary from one scale to another. It is largely 
meaningless therefore to average the ranks obtained from 
two or more different scales since the two scales are 
composed virtually of different units. If we find that an 
object weighs ten pounds and is eight inches long, we are 
scarcely justified in asserting that these two measures 
yield an average of nine something-or-others. Calculating 
average ranks from a series of ordinal scales is just about 
as profitless an exercise, 

This does not mean that all that we can do with ordinal 
scales is to sit back and admire them. There are certain 
statistical operations that can legitimately and usefully be 
applied to rank orders. For example we can examine the 
extent to which two such orders correspond in the sense 
that the same individuals tend to occupy high or low 
positions in both and, in fact, as we shall see later, we 


can express such relationships in fairly accurate quantita- 
tive terms. 


Interval scales. Interval scales ma 
scales within which the interval 
A thermometer Provides an e; 
measurement. Throughout the sc 
one degree and the next is con 
are free at last to add or su 
yesterday was 80°F and today’s reaches go? it is legitimate 
and meaningful to subtract’ one from the other and to 
A that it is ten degrees warmer today than it was yester- 


y be described as ordinal 
s have been made equal. 
xample of this kind of 
ale the difference between 
stant. This means that we 
btract. If the temperature 


THE NATURE OF MEASUREMENT 


day. Unfortunately we are still not allowed to multiply 
and divide. We might be tempted to assert, for example, 
that a temperature of 20° is twice as hot as one of 10° 
but we are not strictly entitled to do so because, in both 
the Fahrenheit and Centigrade scales, o° is not an absolute 
zero representing the total absence of heat but has been 
chosen arbitrarily as a convenient point from which to 
start the scale. The reason for this caveat may become 
clearer as the discussion progresses. 

School examinations might well be mistaken for interval 
scales and are often treated as if they were instances of 
this level of measurement. Examination marks are com- 
monly added and subtracted and averages are computed 
from the marks obtained in different subjects. Are we 
justified, however, in assuming that the intervals between 
every pair of adjacent marks are equal? Consider an 
examination or test consisting of forty short questions, 
One mark being awarded for a correct answer to each. 
Clearly we could regard this as an interval scale only if 
We were satisfied that throughout the range any given 
increase in score corresponded to an equivalent increase 
in the degree of knowledge or skill that was being assessed. 
If, on the other hand, the first ten questions were relatively 
easy and the last ten made much more difficult than the 
rest the effort required by one child to increase his score 
from 7 to 8—by answering one more of the easy questions 
correctly—would obviously not correspond with that re- 
quired of another who increased his score from, say, 36 
to 37. In other words, such a test is strictly an ordinal 
scale, and not an interval one and, therefore, as we have 
seen, we are not permitted to add or subtract the scores 
derived from it. This imposes a serious limitation. We 
often need to perform these and other operations on the 
marks and scores we obtain if they are to serve our 
purposes. In some circumstances it is possible to transform 
an ordinal scale into an interval one so that marks or 
Scores can be legitimately combined and compared. We 
do not propose, at this stage, to discuss the methods in- 


volved in this transformation. Our immediate purpose is 
19 


THE NATURE OF MEASUREMENT 


simply to indicate why it has sometimes to be undertaken. 


Ratio scales. A ratio scale, as one might have guessed, 
is an interval scale with an absolute zero. Measurements 
of length and of weight are of this kind. These scales start 
from a true zero—no length and no weight—and so we 
can say, for example, that four inches is twice as long 
as two inches without provoking any deprecatory mur- 
murs from the mathematicians. At this level of measure- 
ment, indeed, we may add, subtract, multiply and divide 
to our heart’s content. 

School examinations, as we have seen, are often treated 
as if they were interval scales—which in fact they rarely 
are. They are sometimes even regarded as ratio scales. To 
qualify for inclusion within this category they would not 
only have to manifest equal intervals all along the scale 
but we should need to be satisfied that the zero mark 
represented a total absence of the quality or attribute that 
was being assessed. A zero mark is of course awarded to 
some luckless candidates in examinations but this can 
scarcely ever be interpreted as indicating a complete lack 
of knowledge or skill. It would usually be possible to devise 
some easier questions which even these candidates could 
have answered. In other words the zero point in an exami- 
nation is comparable to that used in the measurement of 
temperature—that is, it is an arbitrarily selected point 
from which it is convenient to start the scale. 

Perhaps we can illustrate this and at the same time 
demonstrate more clearly why, in interval as distinct 
from ratio scales, it is not permissible to carry out the 
operations of multiplication and division. Suppose that we 
devise a spelling test comprising 30 items, in which child 
A achieves a score of 24 and child B one of 8. On the face 
of it A would seem to be three times as good as B. Now 
suppose that we enlarge the test by adding 8 much easier 
items which both children can answer correctly. Their 
scores now become 32 and 16 and so A would now appear 
to be only twice as good as B. 


It is of course theoretically possible to devise educa- 
20 


THE NATURE OF MEASUREMENT 


tional tests with an absolute zero. A test of French vocabu- 
lary, involving the translation of French words into their 
English equivalents, would yield a scale on which a zero 
mark indicated a complete ignorance of the French 
language—a state of mind which can readily be envisaged. 
Even so, the practical difficulties involved in producing 
a true ratio scale in such circumstances would be consider- 
able. In educational measurement we must be content, for 
the most part, with interval scales and, in many circum- 
stances, we shall find that ordinal scales are the best that 
we can achieve. 

What is of crucial importance is to recognise the kind 
of scale that is in fact being employed and thus to avoid 
submitting the results to inappropriate kinds of treatment 
and interpretation. 


S.C.E R.T., West Bengal 


Calcutta 
Bc. 


2I 


3 


Designing an examination 


Kinds of measurement 


In the previous chapter we discussed the end-product of 
measurement. It may seem rather like putting the cart 
before the horse to consider the ways in which we propose 
to deal with the results before addressing our attention 
to the methods of testing or examining that we intend 
to adopt. This is nevertheless a defensible procedure. We 
should think very little of an architect who concentrated 
on the choice of materials and techniques and then ex- 
pressed his surprise at the shape of the building that 
emerged. By the same token, it is inadvisable to carry out 
any kind of educational measurement without envisaging 
the outcome and satisfying ourselves that it will serve our 
purposes. 

We have referred to some of these purposes that educa- 
tional measurement may be required to serve: determin- 
ing the extent to which pupils have benefited from a 
course of instruction; satisfying ourselves that the methods 
of teaching we have adopted are effective; diagnosing our 
pupils’ strengths and weaknesses; predicting their future 
performance. Clearly the kind of measurement we under- 
take must be determined in part by the selected object of 
the exercise. The kind of test or examination that will 
successfully predict future performance, for example, does 
not necessarily provide the most serviceable means of dis- 
covering which aspects of a subject a particular individual 
finds most difficult. 


We shall need to return later to a more detailed con- 
22 


DESIGNING AN EXAMINATION 


sideration of the difficulties involved in defining the objec- 
tives of measurement. In the meantime let us consider the 
choices that confront us when, having determined the 
ends we are setting out to achieve, we seek to select an 
appropriate means. We tend to conceive of educational 
measurement primarily in terms of written tests and 
examinations, For certain purposes, however, there are 
other approaches that may be more advantageous. One 
could devise a written examination, for example, to select 
a new fast bowler for a cricket XI. The applicants might 
be invited to indicate, with accompanying diagrams, 
where they would pitch the ball, how they would set the 
field when bowling to a left-handed batsman, the moments 
they would choose to vary length and pace, and so on. 
It is of course conceivable that top marks in such an exami- 
nation might be awarded to a middle-aged, short-sighted 
man with one leg. It would obviously be better to invite 
the applicants to appear at the nets or to take part in 
a practice match before making a choice. Some kinds of 
educational measurement—assessing competence in art and 
crafts or in setting up experiments in a laboratory for 
example—clearly call for tests of a practical kind. We 
commonly distinguish therefore between ‘paper-and- 
pencil’ and ‘situational’ tests. The former may constitute 
the direct measurement of skill—the assessment of the 
neatness and legibility of an individual's handwriting for 
instance—or it may provide evidence which enables us 
to draw inferences about his capabilities in other contexts : 
a written examination in Latin and Greek may be used 
for example, to decide whether or not a person is fit to 
become an income tax inspector. Similarly, situational 
tests may be real or contrived. An employer may take on 
an applicant for a trial period and observe the way 1m 
which he tackles the job that he will be called upon to 
perform or, alternatively, one might copy OY simulate a 
situation and judge a person’s capacity to deal with the 
real thing, so to speak, on the basis of his performance m 
a mock trial. 


We also tend to conceive of educational measurement 1n 
23 


DESIGNING AN EXAMINATION 


terms of assessments carried out by a teacher or examiner. 
In some circumstances however we may need to rely on 
evidence that is collected and reported by the pupils them- 
selves. This is particularly necessary when we wish to 
investigate certain personality traits. It would be difficult 
to devise a means of observing the extent to which a child 
lies awake at night worrying about his school work or 
spends his spare time indulging in fantasies that help 
to compensate for some kind of frustration. Even his overt 
behaviour outside the school cannot be conveniently ob- 
served. A teacher who wishes to discover something about 
his pupils’ interests would doubtless like to have detailed 
information about their extra-mural activities: how often 
do they visit museums, libraries, art galleries or how 
much time do they spend making model aeroplanes, help- 
ing in the garden, or watching television? There is clearly 
a good deal of relevant evidence about children’s behaviour 
and characteristics which lies outside the range of the 
Kinds of observation which a teacher can himself under- 
take. Such evidence can only be obtained if the children 
themselves can be persuaded to supply it. 

These then are the broad choices: between ‘paper-and- 
pencil’ and ‘situational’ tests; and between tests and obser- 
vations carried out by the teachers on the one hand, and the 
reports and self-ratings which the pupils themselves pro- 
vide. It is of course ‘paper-and-pencil’ tests or examinations, 
and, in particular, those designed to measure achievement, 
with which teachers are mainly concerned and we turn 


now to a discussion of the Principles involved in their 
design and construction, 


The design of school examinations 


The situation we are envisaging is that of a teacher design- 
ing an examination to be set at the end of a course which 
he himself has conducted. His primary purpose, we shall 
assume, is to assess the extent to which his pupils have 
responded to the instruction he has provided, although, 


if he is endowed with humility, he may be disposed to in- 
24 


DESIGNING AN EXAMINATION 


terpret the exercise as providing, also, an appraisal of his 
syllabus and methods. 

The golden rule—more honoured in the breach than 
the observance—that should be applied to the planning 
of all such examinations is that they should mirror exactly 
the objectives of the course to which they are related. All 
too often teachers, being human for the most part, tend 
to improvise their examinations rather hastily just before 
they are due to be administered. There is a double risk 
involved in this procedure. The first is that the examination 
may well exhibit the faults that are so commonly attributed 
to external examinations in that it may fail adequately 
to appraise what has been taught and learned. The second 
is that, in such circumstances, the course itself may be less 
satisfactorily planned and conducted. The examination 
should be devised, or at least envisaged in broad outline, 
at the beginning rather than at the end. The advantages 
of such a reversal of the customary order of events are 
obvious. If a teacher knows, before he starts to teach, the 
kind of questions he proposes to put to his pupils at the end 
of the term or year, he will be aware, throughout the 
proceedings, of the priorities that he must seek to establish. 
He cannot afford, for example, to allow his pupils to treat 
a topic—e.g: the French Revolution—as a sequence of un- 
related events if they are to be questioned in the examina- 
tion on the pattern of cause and effect that they exhibit. 

In planning his examination, therefore, a teacher is re- 
quired to identify his aims. The ‘aims of education’ is 
of course a phrase that is good for at least a wry smile in 
most teaching circles. Its use tends to be a prelude to a 
Parade of tired cliches about beauty, truth and goodness. 
The kind of abstract general statements that are cheerfully 
bandied about on speech-day platforms are admittedly 
of little use to a teacher when he gets down to planning 
next term’s syllabus for 4B. He is concerned with detailed 
tactics rather than the broad strategy that governs the total 
enterprise. His aims, therefore, need to be formulated in 
terms of the detailed behavioural changes that he hopes 
to bring about during the course of the next term or year— 

IEM—C 25 


DESIGNING AN EXAMINATION 


and the examinations he administers must be sensitive to 
these changes. The first question that he needs to put is: 
what will my pupils know and be able to do at the end 
of their course that at present they do not know or cannot 
do? The answer to this will serve as a definition of his 
objectives. The second question is: in what circumstances 
can this knowledge and skill best be demonstrated? The 
answer to this will indicate the kind of examination that 
will be required to test the effectiveness of his course. 
To answer these questions satisfactorily calls for a 
much closer analysis of the subject-matter involved than 
1s necessary when one sets out to draw up a conventional 
syllabus. The latter is an indication of the ground that is 
to be covered within a specified period. The process we are 
discussing also takes into account the effect that the sylla- 
bus is likely to have on the pupils who encounter it. In 


that is applicable to every set of circumstances, to every 
subject, for example, or to pupils of every age and level of 
ability—and no adequate Substitute therefore for hard 


subject-matter could b 
tions: physical features; climate; 


ties of his pupils. 


No doubt there will be certai ‘ F f 
knowled: aln specific facts or items 0 


ge that he will require his pupils to memorise. 


DESIGNING AN EXAMINATION 


He will probably also point out to them or arrange for 
them to discover that some of the phenomena they en- 
counter may be classified in terms of some distinctive attri- 
bute—in other words his objectives will include the 
formation or attainment of certain concepts. 

Further, he may wish to demonstrate that among some 
of these concepts there is a relationship which may be ex- 
pressed in general terms as a principle or law. 

Finally, he may hope to secure in his pupils a sufficiently 
firm grasp of the concepts and principles to which they 
have been introduced to enable them to apply their know- 
ledge to a number of, for them, new problems. 

This kind of plan or blue-print can most conveniently be 
represented in the form of a two-way table which can 
serve as a guide both to devising the examination and to 
planning the course. Down one side of the table should be 
listed the broad topics or areas of subject-matter with 
which the course is concerned, and across the top the 
major kinds of operation that the pupils will be expected 
to perform on the material with which they deal. For 
example, the objectives that we have outlined above could 
be set out in Fig. 1: 


Behaviour Memoris- | Formation | Understand- | Applying 
ing speci-|of concepts| ing relation- | knowledge 

Subject fic facts ships, prin- | to new 

matter ciples, laws | problems 


Physical features 


Climate 


Communications 


Agriculture & 
industry 


Fig. I 


Each of the sixteen cells in the table, therefore, represents 
a distinctive feature of the course and we must at least 


consider the possibility of devising examination questions 
27 


DESIGNING AN EXAMINATION 


on items appropriate to each. Indeed for practical purposes 
the table should be much longer than the one indicated 
here, Each cell should provide sufficient space for the in- 
sertion of notes offering suggestions for writing items 
suitable for testing the objective with which it is concerned. 
We must, of course, ask ourselves first of all whether or not 
we have devoted—or propose to devote—equal amounts 
of time and effort to all the aspects of the course that 
are shown in the table. This is unlikely. The emphasis is 
probably unevenly distributed—both among the content- 
areas and almost certainly among the kinds of intellectual 
activity that we have tried to encourage. We should start 
therefore by inserting in each cell percentage figures to 
show how each should be weighted. The percentages as- 
signed to objectives and to content-areas may be multiplied 
together to determine the percentage of items required 
for each cell. Thus a cell corresponding to an objective 
with 30% weight and a content area with 20% weight 
should carry 6% of the items in the examination. In a 
beginner's course, it might be justifiable to have the 
first column—the memorisation of specific facts—much 
more heavily weighted than the subsequent ones which 
call for higher-order intellectual processes. These latter can- 
not be expected to become prominent until the pupils 
have some knowledge to work on, so to speak. It is this 
column, of course, that is the simplest to deal with in terms 
of devising examination questions. If, however, we have 
encouraged our pupils to think about the knowledge they 
have acquired and to make use of it in solving problems, 
we must make entries in the remaining columns if the ex- 
amination is to provide a valid assessment, 

Let us consider one of the rows—the one concerned with 
physical features—and assume that in this instance We 
have tried to encourage all four levels of intellectual activ- 
ity. Ttems appropriate for the first cell in this row would 
require the pupils to locate and to label the rivers, moun- 
tain ranges, and other physical features in the region they 
have studied. Under the second heading we should include 


cents which would reveal the extent to which pupils 
2 


DESIGNING AN EXAMINATION 


understand the properties and characteristics of rivers and 
mountains in general. For example, given land-forms of 
different heights above sea-level—ro0 feet, 1,200 feet, 2,500 
feet, 5,000 feet—which of these could be classified as 
mountains? The items in the third cell would be designed 
to reveal the extent to which the pupils have grasped the 
relationships obtaining amongst the objects and events 
they have encountered. We would probe their understand- 
ing, for example, of the factors that determine the pat- 
tern of drainage in the area. In the last column, since we 
are concerned with the application of knowledge to new 
problems, we must of course pose new problems. We 
could for example furnish a map of an unfamiliar region, 
providing incomplete information about its physical fea- 
tures and inviting the pupils to predict the rest—for 
example, part of a river’s course might be indicated and 
the pupils asked to indicate, from an inspection of the 
contours, the direction in which it would continue. We 
might also seize the opportunity, under this heading, of 
forging links between the different content-areas. Given a 
map indicating the physical features, climate, and com- 
munications, and resources of an unfamiliar area, We 
could ask questions about the probable location of indus- 
trial centres. 


Choice of questions 


This initial preparation enables us to determine the pro- 
portion of the total examination and the general questions 
that are to be assigned to each cell in the table. We must 
now decide how many questions or items will be required 
to satisfy these conditions. This decision will be affected by 
a number of factors. The length of an examination must 
obviously depend, in part, upon the age and ability of the 
Pupils involved, and the number of separate questions that 
it contains may vary according to whether it is designed 
as a ‘speed’ or a ‘power’ test. In most circumstances we 
are mainly concerned to discover the extent of our pupils 


knowledge and skill and, although, as a matter of eee 
9 


DESIGNING AN EXAMINATION 


trative convenience, we may set a time limit to the examin- 
ation, we try to ensure that they have ample opportunity 
to tackle every question. This is what is meant by a 
‘power’ test. On the other hand, we may be specially con- 
cerned to measure the speed as well as the accuracy with 
which children can perform a particular task. In this case 
we could include rather more items than the majority 
could deal with in the time allowed. The crucial factor that 
determines the number of questions in an examination, 
however, is the decision we take concerning the kind of 
item we intend to employ and, in particular, whether we 
propose to use questions calling for very short answers— 
‘objective’ items—or those which require a lengthy con- 
nected discourse—‘essay-type’ questions. If we employ the 
former exclusively the examination might consist of fifty 
to sixty questions or more; whereas about eight of the 
latter would probably constitute the maximum that could 
be envisaged. Thus, if we have decided to devote 6% of 
the examination to a particular cell in the table, we might 
assign to it 3 questions in a wholly objective type of ex- 
amination, but perhaps only half a question in a more 
conventional paper. 
Á The choice between objective and essay-type questions 
is sometimes discussed as if it involved a total commit- 
ment: one is invited to take sides in a struggle between the 
progressives and the reactionaries. This is a distorted view 
of the situation, Each type of question has its manifest ad- 
vantages and limitations but the balance between the two 
needs to be considered in relation to the kind of measure- 
ment one proposes to undertake. In other words, instead 
of seeking to determine which approach, in general, is the 
more efficient, it is more profitable to make one’s choice 
in terms of which type of question best serves a specific 
purpose. 
Essay-type examinations have certain well-known de- 
fects. Since only a relatively small number of questions 
can be included it is difficult, by this means, to secure an 
adequate sampling of the content of a course. This may 


mean that the pupils are denied the opportunity to reveal 
30 


DESIGNING AN EXAMINATION 


a good deal of the knowledge and skill that they have 
acquired. Furthermore the form in which they are required 
to demonstrate their accomplishments puts a premium 
on verbal facility. Consequently such examinations may 
furnish misleading evidence about the extent to which 
some objectives have been attained: for example, a child 
who has achieved an adequate understanding of a scientific 
principle may nevertheless be incapable of producing a 
piece of connected prose that will satisfactorily indicate 
his mastery; alternatively, another—unless we are on our 
guard—may be able successfully to camouflage his ignor- 
ance. Finally, the assessment of essay-type examinations 
depends to a considerable extent on subjective judgment 
and, for that reason, accuracy and consistency of scoring 
are difficult to achieve. 

The chief merit of this type of examination is that it 
provides the most convenient means of assessing some of 
the ‘higher’ levels of skill and attainment. We noted earlier 
that an examination of this kind puts a premium on verbal 
facility, which we instanced as one of its defects in some 
circumstances. In others, of course, this would constitute 
a virtue. If we wish to assess the ability to marshal evi- 
dence, to sustain an argument, or to communicate effec- 
tively, essay-type questions would seem to be the obvious 
choice. 

Objective tests involve questions which are designed to 
yield one, and only one acceptable answer. These questions 
may take one of several forms, the commonest being ‘open- 
ended’, ‘true-false’ and ‘multiple choice’. The first of 
these invites simple recall—e.g. the largest city in France 
is — ; the second term is self explanatory—¢-8- the largest 
city in France is Paris, True/False?; and the third invites 
the testee to identify the correct response when this is 
offered alongside a number of other possibilities—e.g. the 
largest city in France is Marseilles /Lyons/Paris/Brussels/ 
Nancy. 

For each of the shortcomings 
type examinations objective tes 
permit of a large number of 


that we attribute to essay- 
ts can offer a remedy. They 
separate questions so that 

3I 


DESIGNING AN EXAMINATION 


the content of a course can be more effectively sampled. 
The required form of response—writing a word or phrase 
or underlining the correct answer—accords no special 
favours to the verbally fluent. And of course, since the 
correct answer to each question is clearly specified, assess- 
ment is a wholly objective process and can be carried out 
rapidly and accurately. 

The major criticism that is usually levelled against this 
type of test is that it can be appropriately used to assess 
only a limited range of intellectual operations. In this 
country objective tests have been used mainly in primary 
schools and as part of the eleven-plus examination adminis- 
tered at the end of the primary school course. At the 
secondary and subsequent stages conventional essay-type 
examinations tend to be employed for the most part. It is 
widely assumed that this distinction is determined by 
differences in the scope of the two forms of measurement. 
Established practices in the United States indicate that this 
assumption is questionable : objective tests are extensively 
used in American high schools and even, to some extent, 
in Universities. Although it may be difficult to extend the 
use of some types of item—those of the ‘true-false’ kind, for 
example—beyond that of assessing the extent to which 
relatively isolated items of information have been acquired, 
there are other forms of objective measurement that can 
be adapted to serve much more ambitious purposes. 
Multiple-choice items, for instance, can be used at the most 
advanced levels. Such an item consists essentially of two 
components—a ‘stem’ and an array of ‘responses’. The 
stem, in the example we quoted earlier, was an incomplete 
statement (the largest city in France is . . .) and the res- 
ponses included one satisfactory means of completing it- 
Clearly much more complex problems could be posed 
in this form. The stem could provide, for example, the out- 
line of a topic for research and the responses could con- 
sist of a series of research designs, including statistical 
processes for the analysis of the results, from which the 


testee was to choose the most appropriate. Indeed there 


is theoretically no limit to the level and complexity of the 
32 


DESIGNING AN EXAMINATION 


intellectual processes that can be adequately tested in this 
way. (For evidence to support this claim the reader should 
consult the illustrative items in Bloom, 1964.) There is 
however a limit in practice which is determined by the 
ingenuity and expertise of the test constructor. This is 
why we suggested earlier that essay-type questions provide 
the most convenient means of assessing some of the ‘higher’ 
levels of skill and attainment. Until a teacher has had con- 
siderable experience of the construction and use of ob- 
jective tests he would probably be well-advised to restrict 
their application to the measurement of fairly routine 
skills. 

The major purpose of this comparison is to show 
that it is unnecessary and inadvisable to restrict one’s 
choice either to essay-type or objective questions. On the 
other hand combining them in the same examination is not 
to be recommended. If they are included within a single 
timed examination, pupils are likely to devote different 
proportions of their time to the two groups of questions 
and our assessments of their performance will thus be dis- 
torted. It is clearly preferable to divide the examination 
into two separately timed sections, or better still, perhaps 
to give an objective test and an essay-type examination on 
different occasions. By the same token a choice of questions 
should be avoided. Consistent and objective assessment is 
difficult to achieve if we set ourselves the task of comparing 
the performances of pupils in different tasks. Furthermore 
if we are designing an examination to conform to a blue- 
print in the way we have described, offering a choice of 
questions is likely to thwart our purposes, since some of 
our specified objectives may be left inadequately assessed. 

One final point needs to be made about the choice of 
questions or items. This concerns the level of difficulty at 
which one should aim. Items may range from those that 
all or most of our pupils will be able to deal with success- 
fully to those that confound all but the ablest minority. 
The mixture or balance that we select ought clearly 
to be determined by the purpose for which the examina- 
tion is intended. If our object is to select a small number 

33 


wa 


DESIGNING AN EXAMINATION 


of individuals so superior to the rest as deserve promo- 
tion or the award of prizes it would be reasonable to treat 
the examination like an obstacle race: it would obviously 
include a large proportion of difficult questions which 
would eliminate the majority from the competition. On the 
other hand if we are seeking to discover the extent to which 
all the pupils in a class have profited from a course of in- 
struction our object should be to enable each child to 
reveal such knowledge and skill as he may have acquired. 
The proportion of questions that all pupils are capable of 
answering satisfactorily should therefore be high in an ex- 
amination of this kind. 

The kind of preparation we have described, if under- 
taken thoroughly, should make the drafting of the ques- 
tions or items themselves a relatively straightforward 
process. If we have formulated our objectives with preci- 
sion, we shall have correspondingly clear ideas about what 
the examination should contain. Some of this precision 
and clarity may be lost of course if the questions are not 
carefully framed. We may know exactly what we set out 
to teach; we may have successfully defined the kinds of 
performance which will reveal the extent to which our 
teaching has been effective. There still remains the task of 
making clear to the pupils precisely what kind of perfor- 
mance is expected of them. 

The difficulty about tendering advice in this respect is 
that, although the problem can be stated in general terms, 
the required solution has to be adapted to suit a wide range 
of differing circumstances, The phrasing suitable for a 
problem in mathematics may not be apt if introduced into 
a question in a history examination; instructions that 
would be clear to a fifteen-year-old may not be suitable for 
a child of ten or eleven. In other words drafting good eX- 
amination questions is not so much a question of following 
a recipe or formula as that of developing a flair. It is rather 
like learning to play cricket. Sage advice such as ‘keep your 
eye on the ball’ or ‘play forward with bat and pad together” 
may be helpful to the beginner, but if he is to achieve com- 
petence he must practise strokes for himself, discovering 
34 


DESIGNING AN EXAMINATION 


which go safely along the ground for four runs and which 
result in his being caught at the wicket. 

A general rule to follow when drafting questions for an 
‘essay-type’ examination is to try to envisage the kind of 
answer that the children concerned are likely to provide. 
This in itself will help to insure against vagueness and am- 
biguity. A question beginning ‘State three reasons why . - ~ 
or ‘List the arguments for and against . - ’ leaves little 
scope for misunderstanding. On the other hand ‘What do 
you think of Henry VIII?’ would get the varied treat- 
ment it deserved. A further insurance would be to invite 
a colleague to predict the way in which the children are 
likely to react to the questions. Eventually, of course, the 
children themselves will demonstrate to the full any short- 
comings that may remain. 

There is a large number of do’s and don’ts that applies 
to the construction of each type of objective test item. 
Multiple-choice items, in particular, require considerable 
care. It is important, for example, if this type of question 
is to be effective, that each of the available responses has 
an equal chance of being considered. Consider the item 
we quoted earlier as an illustration of this type: the largest 
city in France is Marseilles /Lyons/Paris/Brussels/Nancy. 
This would probably serve to distinguish between those 
who had or had not acquired this item of knowledge. If, 
however, we posed it in a different form, e.g. the largest city 
in France is Bolton /Manchester/Paris/Liverpool/Wigan— 
the fact that children in Lancashire at any rate indicated 
the right answer would not prove very much. The incorrect 
responses must be at least plausible and therefore should 
be similar to the correct one in appearance. 

In the notes for further reading we suggest sources of 
detailed advice on the drafting of questions of all kinds. 
We would again emphasise, however, that it is by paying 
close attention to the outcome—the way in which pupils 
are found to respond—that a teacher can best develop the 
art of examining. 


35 


4 


Dealing with marks and scores 


The distribution of scores 


Once an examination has been set and administered, and 
marks have been awarded and totalled, one might be 
tempted to regard the exercise as complete. There is a 
good deal still to be done, however, if we are to make full 
use of the evidence that has been obtained. The evidence, 
indeed, is not immediately available in a serviceable form. 
There is little enlightenment to be derived from the con- 
templation of sets of marks in their raw state, as it were. 
For example, a teacher might initially find himself con- 
fronted with something like the following table (Fig. 11)- 


A B 


C 
33 12, 6, 45,27, (35, 22, 15, 44; a6, 32, 19, 22, 14, 37) 
25 Tl, 37, 22, 16, 31, 24, 49, 34, 37, 9, 29, 21, 5, 33 
48, 26, 37, 21, 3, 26, 46, 33, 29, 42, 8, 24, 31, 7 16 


12, 41, 28, 17, 39, 36, 45, 19, 


41, 37, 26, 11, 21, 2 26 
26, 14, 34, 22, 19, 31, 44, 


27, 43, 34, 16, 23, 31, 8, 27 
40, 24, 22, 15, 32, 10, 45, 37, 37, 32, 18, 10, 30, 15, 25 
24: 276 30; 23.3%, E 44 ae. E 8, 33, 22, 7, 26 


19, 24, 27, 20, 33, 39, 36, 31, 
14, 27, 20, 29, 23, 37, 34, 16, 
3I, 22, 16, 36, 27, 33, 40, 
9, 28, 25, 17, 29, 37, 41, 
13, 18, 44, 12, 28. 28, 48, 


27, 46, 12, 23, 18, 4 œ 
37, 42, 21, 8, 24, 10, 28, 
38, 29, 34, I, 22, 16, 5, 25» 
37, 4, 39, 17, 8, 27, 12, © 
39 38, 40. 20, 44, 23, 36, 6 


Fig. II 


These may be the results of an examination set to three 


groups of pupils or perhaps the marks obtained by the same 
pupils in three different subjects. 


36 


DEALING WITH MARKS AND SCORES 


There are several questions that a teacher would want 
to ask about these results. He would obviously want to 
know something about the general standard obtained: do 
the marks tend to be high, middling or low? do the three 
groups exhibit differences in this respect? He would also 
want to know whether most children achieved this stan- 
dard, or whether a sizeable proportion of them fell short 
of it or exceeded it by a considerable amount. 

It is difficult even to begin to answer these and similar 
relevant questions if the marks are haphazardly arranged. 
If we are dealing with very small batches it may be possible 
to spot trends and tendencies by inspection. If they are 
numerous, however, we cannot hope to deal with them 
effectively until we have introduced some order into their 
arrangement. This operation is described as tabulation. 
One obvious approach would be to set them down in order 
of size. This is easy enough if we are dealing with relatively 
few marks but becomes more laborious as the number 
increases, A more manageable alternative is to draw up a 
frequency table. This involves counting how many times 
each mark occurs in the set. For example, the marks in set 

A above would be set out in a frequency table as in Fig. II. 


Score F Score F Score F 
48. I 30. I 18. I 
45. I 29. 2 17. 2 
44. I 28. 3 16. 2 
41. I 27. 5 I5. 1 
40. I 26. 2 14. 2 
39. I 25. 2 13. $ 
37. 2 24. 3 12. 3 
36. I 23. 2 Il. s 
34. I 22. 4 9 $ 
33. 2 21. I 6. x 
32. I 20. 2 3 3 
3I. I 19. 2 

Fig. II 


It is sometimes found even more convenient—and just as 
useful in practice—to deal with marks in groups, or classes, 


rather than to treat each one separately. An appropriate 
37 


DEALING WITH MARKS AND SCORES 
frequency table for the marks in set A is shown in Fig. 


Frequency 
Score E 
45-50 // 2 
40-44 IL 3 
35-39 //// 4 
30-34 // 7 
25-29 LHT LH I I4 
20-24 Ltt LHT I 12 
15-19 LT lll 8 
10-14 Ht II 7 
5-9 // 2 
0-4 f I 


Fig. IV 


A convenient way of drawing up a frequency table 
is to use tallies or five-barred gates as shown in the middle 
column of this table. This involves placing a stroke beside 
any given mark in the left-hand column each time we en- 


œ 


No.of Cases 
+ Oo 


N 


04 59 1014 15419 20-24 252930343539 40-4445-50 
Score 


Fig. V 
38 j 


DEALING WITH MARKS AND SCORES 


counter it as we look down the list of marks. Each fifth 
stroke is made horizontally, so that the frequencies can be 
totalled conveniently at the end. It is already clear that 
once a set of marks has been arranged in this fashion we 
can begin to see the pattern that it exhibits. We can per- 
ceive the distribution of the marks—whether, for example, 
they are scattered fairly evenly over the whole range 
or whether they tend to pile up at certain points on the 
scale. 

The distribution of a set of marks may be observed even 
more clearly if it is represented graphically. The diagram 
on page 38 (Fig. V), which is called a histogram, sets 
out the facts included in the previous frequency table 
in a form which enables us to inspect the distribution 
directly. 


An alternative way of illustrating the facts is by entering 
on a graph the mid-points of each group or class of marks 
in the frequency table. The result is called a frequency 
polygon (Fig. VI). 


œ 


o 


No. of Cases 


p 


pÅ- 


<j 
2 7 le Ww 22 ær 32 37 42 47 
Score 


Fig. VI 
39 


DEALING WITH MARKS AND SCORES 


If this kind of graph contained a very large number of 
points, it would eventually become a smooth curve. The 
following (Fig. VII) illustrates the kind of curve that is 
frequently encountered in educational and psychological 
measurements. It is called a ‘normal’ curve and represents 


No.of Cases 
—> 


= 
Score 
Fig. VII 
N 
w 
2) 
oO 
“a | 
Qa 
oO 
(e} 
= 
Sa 
Score 
Fig. VIII 


40 


DEALING WITH MARKS AND SCORES 


the distribution that is obtained when, for example, a 
standardised test of ability or attainment is administered 
to a large representative sample of children. 


Let us suppose that a test which yielded this kind of 
distribution when given to a representative sample—that 
is to one containing in their proper proportions children 
of all levels of ability were applied to children drawn 
exclusively from ‘A’ streams. In these circumstances there 
would be a heavy preponderance of high marks with a 
kind of distribution described as being negatively skewed 
(Fig. VIII). 


If the test were applied to ‘C’ stream children only it would 
reveal the opposite state of affairs and the curve would 
be positively skewed (Fig. IX). 


No.of Cases 
—_—_ 


— 


Score 


Fig. IX 


If it were applied to a sample consisting of ‘A’ and Cc 
stream children but with all ‘B’ or average pupils omitted 
we should obtain a doublehumped or bi-modal curve 
as in Fig. X. 


IEM—D 41 


DEALING WITH MARKS AND SCORES 


We have provided these examples to illustrate the fact 
that the shape of the distribution exhibited by a set of 
marks supplies us with useful information about the per- 
formance of the pupils concerned and therefore about the 
adequacy or otherwise of the examination. The reader 
should, as a practice exercise, tabulate the marks given in 
sets B and C on page 36, and then draw either a histogram 
or frequency polygon. These figures will demonstrate to 
him some of the different distributions that have been 
described above. 


No.of Cases 
—_ 


Central tendency 


When we have tabulated a set of marks we have intro- 
duced some order into the proceedings. What we now Te 
quire are distinguishing labels, so to speak. We need some 
convenient means of describing the characteristics of each 
set of marks with which we have to deal. In particular, 
we wish to be able to state the ways in which one differs 
from another. One such distinctive feature we have 
already discussed—the distribution of the marks. There 
42 


DEALING WITH MARKS AND SCORES 


are two others which are useful for this purpose: a 
measure of central tendency and a measure of spread. 

As we have suggested earlier some of the questions that 
a teacher wants to answer concern the general level of 
the marks he is considering. May they be described as 
high or low? Do they tend to be better or worse than 
those of another group of children or than those obtained 
in a previous examination? To answer these questions he 
needs a measure of the central tendency which may take 
one of three forms each of which has its uses in particular 
circumstances. The commonest is the arithmetic mean— 
which is what we are usually referring to when we use 
the term average. This may be calculated, of course, by 
adding up all the marks and dividing the total by the 
number of pupils. Another measure of central tendency 
is the median. This is the mark obtained by the pupil who 
occupies the mid-point of the distribution or who, in other 
words, is exactly half way down the list if the marks are 
arranged in order of merit. (The median of the marks in 
set A (page 36) is 24.5.) Finally, we may use the mode 
which, as the word suggests, is the most fashionable , 
or most frequently occurring score in the distribution. (For 
set A it is 27.) The reader may care'to discover for himself 
both the median and the mode of the marks given in sets 
B and C on page 36. 

The mean is the most useful of these measures of cen- 
tral tendency and the one normally employed, but we are 
justified in using it only if the distribution with which 
we are concerned is reasonably symmetrical. Suppose for 
example that we were collecting evidence about the earn- 
ings of teenage youths in a particular neighbourhood and 
that we discovered the weekly wages of a small sample— 
eleven in all. Ten of them we found earned £10 per 
week and the eleventh—having taught himself to play the 
guitar—earned £450 per week. If we reported that the 
average earnings of the group was £50 per week we 
should obviously be supplying misleading information 
about the circumstances of the vast majority. Ten of the 


eleven, in fact, earn considerably less than the average, if 
43 


DEALING WITH MARKS AND SCORES 


this is expressed in terms of the arithmetic mean. The 
median wage for the group would be £10, of course, and 
so would be the mode. Clearly, if we wish to characterise 
this group—to give someone else an accurate impression 
of its standard of living for instance—we should avoid 
the mean which is distorted by the inclusion of one 
unusually wealthy member and use the median or mode 
instead. In normal circumstances, however, the distribu- 
tions with which we deal are reasonably symmetrical and 
the mean is a satisfactory measure to employ. 


The calculation of the mean 


The formula for calculating the mean is of course a delight- 
fully simple one—the sum total of the marks divided by 
the number of pupils. If, however, we are dealing with 
large numbers, this operation can become very tedious. 
It can be simplified somewhat if we tabulate the marks 
in the form of a frequency distribution. Fig. XI shows the 
marks of set A (see page 36) arranged in this way. 


Mid-Point Frequenc’ 
Marks X k i Ex 
45-50 47 2 7 
40-44 42 3 me 
35-39 37 4 18 
30-34 32 7 z 
25-29 27 14 378 
20-24 22 12 264 
15-19 17 8 136 
10-14 12 7 84 
5-9 7 2 i 
0-4 2 I 2 
TER +*5fX=1470 
PUI 
Ss 
M= aax M= =24.5 


60 


*The total or sum of a set of numbers is indicated by the Greek 
letter X. 


Fig. XI 
44 


DEALING WITH MARKS AND SCORES 


The first column gives the marks in groups Or classes. The 
X column shows the mid-point or average of each of these 
classes. The f column shows the frequency—that is, the 
number of pupils obtaining marks within each of the 
classes—and the fX column shows the product of the 
two previous columns. When marks are set out in this 
way a reasonable approximation to the mean of the dis- 
tribution may be obtained by finding the total of the fX 
column and dividing by N (the number of pupils). Thus, 
as can be seen, the marks in set A yield a mean of 
1478 _ 

60 tS 

Even this method can involve a good deal of tiresome 
computation. This can be reduced by electing to involve 
small rather than large numbers in our calculations. This 
can be arranged quite simply. For example, suppose that 
we were confronted with the task of calculating the mean 
height of a large number of individuals. Even if frequency 
tabulation were used, the task would be a formidable one. 
Now suppose that the following five measures were a 
typical sample of the quantities with which we had to 
deal: 5 a” y 9” g 84”, y rr’, 6 oy Clearly 
averaging these would be very much simpler if we sub- 
tracted 5’ 4” from each and dealt with the small re- 
mainders. The average of the latter is 5”. If we now 
restore the missing 5’ 4” we arrive at the average of 5 9”, 
having avoided a good deal of unnecessary trouble. 

This procedure can be applied to the calculation of 
means. This can be done by using an assumed mean (AM). 
Setting out the marks in the form of a frequency table 
as before, one now guesses where—that is, within which 
class of marks—the mean is likely to fall. It does not 
matter whether or not the guess is correct. The table now 
appears as in Fig. XII. 


A zero is placed, under column x, against the class of 
marks within which the mean is assumed to lie and the 
remaining classes are numbered +1, +2, ZI, —2, etc. 


to indicate their distance from the mean. 
45 


DEALING WITH MARKS AND SCORES 


Mark Mid- Class 
Intervals Point Frequency No. 
(i=5) X f x fx 
45-50 47 2 4 8 
40-44 42 3 3 9 
35-39 37 4 2 8 
30-34 32 7 I 7 
25-29 27 14 o = 
20-24 22 12 -iI —12 
15-19 17 8 =F —16 
10-14 12 z -3 —21 
5-9 7 2 —4 — 8 
o4 2 I -5 -5 
_N=60 —62 +32 
Xfx=-—30 
Xfx x i 30X5 


M=AM+ N =27.0+ = =27.0—2.5=245 _ 


Fig. XII 


Expressed as a formula, the procedure for estimating the 
Sfx x i 
AM + 


mean by this method is 
where Xfx is the total of the fx column, i is the class 
interval and AM is the assumed mean. 

Thus in this instance the mean would again be 24-5. 

The arithmetic mean for this set of marks calculated by 
adding all the marks together and dividing by N is 24.67 
so it will be seen that the short method provides a reason- 
able approximation. For most purposes this kind of 
estimate of the mean is all that is required. The extra 
labour needed for a strictly accurate calculation is not 
normally justifiable. In order that he may become familiar 
with the procedure for calculating a mean from grouped 
data, the reader should work out for himself the means 
of the other two sets of marks, B and C, given on page 36. 


46 


DEALING WITH MARKS AND SCORES 
Spread or variability 


When we are dealing with a set of marks or trying to 
interpret the evidence that it discloses it is important not 
only to secure a measure of central tendency but also to 
discover the spread or variability of the marks. It is possible 
for two examinations to yield similar means but for the 
spread of marks to differ considerably. The distributions 
of the two sets of marks might, for example, be as in 
Fig. XIII. 


No.of Cases 
No.of Cases 


Fig. XIII 


In such circumstances treating the examinations as Com- 
parable, simply because they have the same mean, would 
be misleading. Consider the following example. This shows 
some of the results obtained by the same pupils in two 
different examinations one, say, in mathematics and the 
other in English. We show (Fig. XIV) the marks obtained 
by the pupils at each extreme of the rank order. 


Mathematics English s 
Rank Order Mark (%) Rank Order Mirt (%) 


I 98 I 

2 94 2 59 

3 87 3 54 
28 14 28 32 
29 8 29 28 
30 o 30 27 


Fig. XIV 


DEALING WITH MARKS AND SCORES 


It is clear that in the mathematics examination the marks 
cover the whole of the possible range, whereas less than 
half of it has been used in the English examination. The 
spread of the marks is therefore much greater in the 
former case. This means that combining these two sets of 
marks to achieve a grand total or overall rank order would 
obviously lead to some injustice. The mathematics exami- 
nation would contribute much more ‘weight’ to the total. 
The pupil who leads the field in English carried only 68 
marks into the final reckoning, whereas another, who 
may be well down the order in mathematics, will never- 
theless have more to contribute. In other words it is more 
advantageous, in terms of the final result, to do well in 
mathematics than to do well in English. And yet it is 
unlikely that it is intended that the result should reflect 
this. When we add marks together we are usually assum- 
ing that they are ‘weighted’ equally. This is only the case 
if each has the same spread. The importance of the spread 
of marks can be illustrated by considering the following 
rather extreme example (Fig. XV). 


Percentage Marks 


Pupil Maths. English History Geog. Total 
A 90 30 40 48 208 
B 70 40 45 49 204 
c 50 50 50 50 209 
D 30 60 55 Si 196 
E 10 70 60 52 192 

Fig. XV 


In Mathematics the marks are rather well spread out 
among the five pupils, with A leading the order of merit, 
B second, etc. In each of the other three subjects the 
order is completely reversed, E being the top scorer mM 
every case. The marks, however, are bunched much closer 
around the mean, particularly so in the case of geography- 
The total of all four examinations produces the same order 
of merit as that for mathematics, and the fact that E 
came top in three of the four has not counted at all. It 


will be noticed that the mean score in each examination is 
48 


DEALING WITH MARKS AND SCORES 


exactly 50. Clearly if we want a particular subject to 
carry a great deal of weight in a total examination score, 
the thing to do is to spread the marks in that subject out 
over the whole range of available scores. 

There are various ways in which the spread can be 
estimated, The simplest measure would be obtained by 
subtracting the lowest from the highest mark. This in- 
dicates the range of the marks in each case. In the first 
of the two examples above the ranges are 98 and 41 
respectively—a comparison which adequately reflects the 
state of affairs. The range may be a misleading indication 
of the spread of a set of marks, however, if one or more 
of the extreme scores are atypical. If the highest mark in 
the mathematics examination, for example, had been 
obtained by one unusually gifted pupil and the mark of the 
second in the order had been something like 50%, the 
range we have quoted would be a distorted representation 
of the way in which the marks were scattered. 

This kind of distortion can be avoided by using the 
inter-quartile range. This involves subtracting the mark 
obtained by the pupil who is three-quarters of the way 
down the order from that of the pupil who is one quarter 
of the way down. By this means any atypical extreme 
scores can be ignored. The best measure of spread, how- 
ever, is provided by the standard deviation which takes 
into account the difference between every mark and the 
mean. The formula for calculating the standard deviation 
is as follows: 

—+F TF. 
sum of (d)? 
Standard Deviation (S.D. or o) = J =N 


where d is the difference between each mark and the mean 
mark, and N is, of course, the total number of marks or 


pupils. 


The calculation of the standard deviation 


As we found when we discussed the calculation of means, 


the standard formula is simple enough, but may involve 
49 


DEALING WITH MARKS AND SCORES 


tedious computation if we have to deal with a large array 
of marks. Again we may resort to a shorter method which 
yields a reasonable approximation. All that one requires 
in this instance is to add one extra column to the tabulation 
shown on page 46 for the calculation of the mean. This 
is the fx? column. The entries in this column are the 
products of those in the f column and the squares of those 
in the x column. Having set out the marks in this way, 
an estimate of the standard deviation is obtained by apply- 
ing the following formula: 


sp.= 4/38 3fx \? 
N ae ) 


where i is again the class interval (5 in our case), N is 
the total number of marks and the other components are 
as before. i 

We provide a worked example based on the marks in 
set C from page 36 (Fig. XVI). 


Mark Mid- Class 

Intervals Point Frequency No. 2 

i=5 X f x fx ax 
40-44 42 I 5 5 sF 
35-39 37 2 4 8 y 
30-34 32 6 3 18 2 
25-29 2 9 E 18 36 
20-24 22 12 I 12 n 
15-19 17 8 o = p 
10-14 12 6 -1 = 6 f 
S9 7 Ir —2 —22 Ae 
o-4 2 5 -3 -55 g 
N=60 61—43 Mx? = 254 


SMix=18 —— 


=a / 254 _18* V4 = 
spaa/ 2548 ; 4233—.09=4/ 4.143=5% 2.036= 10.18 
5 


Fig. XVI 


While the calculation of the standard deviation is not 


difficult some practice is required if the procedures for 
50 


DEALING WITH MARKS AND SCORES 


working it out are to be committed to memory. The reader 
should, therefore, calculate for himself the standard devia- 
tion of the remaining sets of marks, A and B, given on 


page 36. 


Some statistical concepts 


If a teacher is able to estimate the means and standard 
deviations of sets of marks and to determine the shape of 
their distribution, he is equipped to carry out most of the 
basic operations involved in comparing and combining 
them. Our primary object, however, in introducing these 
topics is not to enable teachers to acquire the statistical 
knowledge and expertise required for the full and proper 
treatment of examination marks and test scores—to fulfil 
this aim would require a sizeable volume rather than a 
short chapter—but rather to help them to develop an 
appreciation of the general nature of these processes and 
of the purposes they serve. Teachers are not expected to 
become statisticians in their own right, but there is an 
increasing need for them to gain some insight into the 
ways in which statistical techniques can help to improve 
the art of examining. They may be called upon to Co- 
operate in the processes of moderation, for example, or to 
administer standardised tests. They are entitled to expect 
help and advice in these circumstances put they will be 
better able to follow the instructions they receive if they 
are familiar at least with the vocabulary of statistics—even 
if they are unable or unwilling to concern themselves 
with its grammar and syntax. To this end we propose to 
comment briefly on some statistical concepts to which 
reference is frequently made in the literature dealing with 


examinations and tests. 


Correlation 


The first is that of correlation. By correlation we mean 
the relationship between any two variables. The tempera- 
ture in a room is a variable—it changes from time to time 

51 


DEALING WITH MARKS AND SCORES 


ranging from very cold to very warm. The level of mer- 
cury in a thermometer is another variable, and of course, 
these two are closely related: as one increases, so does the 
other. In these circumstances we describe the two variables 
as being positively correlated. If we were to take simul- 
taneous measurements of the volume of liquid in a bottle 
and in a glass during the act of pouring, we would again 
find a close relationship, but this time the two variables 
would be moving in opposite directions so to speak—as 
one increases the other is found to decrease. This is referred 
to as negative correlation. It is also possible to conceive 
of variables that are completely independent of each other, 
such as, according to one embittered recruit, ability and 
rank in the armed services. Such variables exhibit zero 
correlation. The extent of the relationship between twO 
variables can be measured and the result is expressed as 
a coefficient of correlation on a scale ranging from ~! 
(perfect negative correlation) through o (no relationship) 
to +1 (perfect positive correlation). In educational 
measurement we are often interested in the extent tO 
which two sets of examination marks are correlated— 
high positive correlation would indicate that the children 
who had relatively high marks in one examination also 
tended to have high marks in the other—or in the relation- 
ship between academic performance and other variables, 
such as intelligence or socio-economic status. Correlations 
of 0.2 or less indicate a slight, almost negligible relation- 
ship. For example, there is a correlation of about 0.1 be 
tween height and educational attainment. Those of 0-5 
or 0.6 indicate a moderate association—there is a corre 
lation of this order between height and weight, ior 
instance. Correlations of 0.8 and above indicate a sub- 
stantial relationship. The correlation between measure 
intelligence at eleven and performance in secondary 
school a few years later has been found to be about 0-8: 
that between the heights of identical twins as about 0.95: 
A word of warning should be offered, perhaps, COP- 
cerning the interpretation of observed correlations. They 


indicate a relationship but not necessarily one of cause an 
52 


DEALING WITH MARKS AND SCORES 


effect. For example, if one were to measure the attain- 
ments in arithmetic of all the children attending the 
schools in a given area and related this to the size of their 
feet, a substantial correlation would be found. This is 
because each of these variables is associated with chrono- 
logical age: the older a child is, the higher the level of 
his academic attainment is likely to be, and, similarly, 
the oldest children will be found to have the biggest 
feet. It is unlikely, however, to say the least, that the size 


of one’s extremities determines competence in arithmetic. 


Sampling and errors of measurement 


Another group of concepts with which a teacher needs 
to become familiar is that comprising the notions of 
sampling, and errors of measurement. In almost all the 
forms of measurement with which we shall be concerned 
we shall be dealing in effect with samples drawn from a 
population. A population in this context refers to objects 
and events as well as to individuals. We may speak of a 
‘population’ of marks or scores, for example, from which 
a particular sample is under review. Any set of examina- 
tion marks should, in fact, be treated as a sample drawn 
from a large population. Obviously the particular exami- 
nation we give to a group of pupils is only one of a 
large number of examinations that might have been de- 
vised—and so any marks that we consider constitute 
only a sample of all the marks that, theoretically, could 
have been obtained. It follows, therefore, that all the 
marks we deal with are subject to error. Suppose, for 
example, that we wished to discover the mean height of 
the adult male population of the British Isles. To measure 
every individual would be an impossible task. We have 
to be content therefore with an estimate of this mean 
obtained by measuring a sample of the population. Clearly, 
if our sample contained a preponderance of Welsh miners, 
we would be likely to underestimate the true mean just 
as we should be in danger of over-estimating it if we 
included too many members of the Coldstream Guards. It is 

53 


DEALING WITH MARKS AND SCORES 


important then to distinguish, in every form of measure- 
ment, between the ‘true’ score and the ‘obtained’ score 
which only coincide if our measurements are exhaustive 
or if the sample we use is, by accident or design, perfectly 
representative of the population from which it is drawn. 
Fortunately, although we must admit that all our 
measurements are subject to error, we can, in certain 
circumstances, discover what the size of that error 1S 
likely to be and, therefore, we can make due allowance 
for it in interpreting the outcome. Perhaps we can convey 
a rough idea of how this may be done. Suppose, referring 
back to the attempt to discover the mean height of the 
male population, we had the time and patience to measure 
a hundred samples, each drawn at random. The means 
would vary of course, but suppose further that the mean 
of all the means turned out to be 5’ 8” and that 99 out 
of the 100 means that we obtained lay between 5’ 6 
and 5’ 10”. Common sense would surely allow us to make 
a reasonably safe prediction about further means obtained 
with the same instruments and using random samples of 
the same size. Could we not say to another person embark- 
ing on the same venture that he could rest assured that the 
true mean is highly likely to be within two inches either 
side of whatever mean he obtained? In fact, on the basis 
of our experience, we could quote the odds against his 
being unlucky in this respect since we found only one 
example in a hundred of a mean falling outside these 
limits. (This topic is discussed more fully on page 80.) 
It is in terms such as this that the expected errors in 
educational measurement are usually quoted. The manual 
accompanying a standardised test usually quotes the 
standard error of measurement of the test. This tells ¥S 
precisely what degree of error to expect in the scores we 
obtain from it. If the standard error is quoted as 2 
points of score this means that, in two thirds of the cases 
We encounter, the true score will lie within plus or minus 
2 points of the obtained score. We may not be satisfied 
with these odds. If not, we can double or treble the stan- 


dard error and thereby lengthen the odds. The true score 
54 


DEALING WITH MARKS AND SCORES 


will lie within the range + twice the standard error in 
about 19 cases out of 20; and within the range + three 
times the standard error in about 99 cases out of 100. 
If we allow for this latter margin of error we are thus 
very unlikely to be guilty of misinterpretation in any 
given instance. 


Statistical significance 


Finally, the concept of statistical significance is also 
related to the notion of random sampling. Here is a typical 
statement from a report of an investigation involving 
educational measurement. ‘The mean score for the boys 
was 103, that for the girls was 99.4. The difference was 
found to be significant (p=-01)’ What is being said here 
amounts to this. The author is stating that the means 
he is quoting are obtained from random samples (he was 
obviously unable to examine all boys and all girls) and 
therefore that had he chosen other samples the means 
would have had different values. But how different? He 
has undertaken calculations to discover the degree of error 
in his measurements from which he is able to estimate the 
likelihood that a difference of this size would arise by 
chance. In quoting p=.01 he is stating that the probability 
of such an occurrence is only one in a hundred. In other 
words he is claiming that you could carry out this com- 
parison a hundred times and on 99 of the occasions the 
boys would beat the girls. 

We suggested at the beginning of this chapter that, 
confronted with several sets of marks, a teacher might 
wish to determine the extent to which one represented 
a higher level of performance than another. It is import- 
ant to recognise that a straightforward comparison of, for 
example, observed differences between means, does not 
necessarily entitle us to draw firm conclusions about a 
‘true’ difference. 


55 


5 


The efficiency of measurement 


The processes that we described in the previous chapter 
are the necessary preliminaries to the task of interpreting 
the results of an examination. Before attempting this task, 
however, it would be prudent to satisfy ourselves, as far 
as this is possible, that the examination has proved 
adequate. We clearly need to know how much reliance 
can be placed on the results we have obtained before We 
use them as a basis for forming judgments about our 
pupils’ progress and attainments. Thus we need to examine 
the examination itself as well as our pupils. This is an 
exercise which can reveal unsuspected flaws in its con 
struction, which we can seek to avoid when we devise 4 
similar examination subsequently. In this chapter we 
discuss some of the ways in which a test or examination 
may profitably be subjected to systematic scrutiny. 


Item analysis 


First of all it is useful to consider each item or question 
separately. All too frequently we tend to concentrate 
our attention on the total mark that an examination 
yields and, by doing so, it may well escape our notice that 
some of the questions have failed to satisfy our require- 
ments. Each question has presumably been included for a 
specific purpose—to contribute towards the assessment 
of some particular skill or area of knowledge. If we foune, 
for example, that one question had been left unanswer®! 

y all the pupils, it would surely be advantageous tO try 
5 


THE EFFICIENCY OF MEASUREMENT 


to discover the reasons for its unpopularity: it may well 
be that our teaching has been at fault in this respect or, 
perhaps, it was obscurely worded and the pupils were for 
this reason unable to understand what was expected of 
them. In either case action is called for: we need to 
improve the next course or the next examination. 

In this regard it is useful to consider the methods em- 
ployed by those who construct standardised, objective 
tests. One of the processes involved in preparing tests of 
this kind is that of item analysis. This involves trying out 
considerably more items than will be needed in the final 
test and discovering, for each, its facility value and its 
efficiency. 

The facility value of an item indicates how easy or 
difficult it proved to be and is determined by calculating 
the percentage of children who answered it correctly. 
Thus if half the children involved answer an item correctly 
and the other half fail to do so it is said to have a facility 
of 50%. What facility values are to be regarded as accept- 
able will depend, of course, on the kind of test one is trying 
to construct and on the purposes that it is intended to 
serve. It is clear however that in most instances we are 
seeking to achieve some degree of discrimination—between 
the able and the less able or between those who have and 
those who have not profited from our instruction. It is 
unlikely therefore that we shall be content to include too 
many items with extreme facility values. Items that none 
of the pupils can answer spread alarm and despondency 
and do not serve any useful purpose. Those that all the 
pupils can answer may be justifiable as morale-boosters, 
particularly at the beginning of a test but fail to disclose 
differences in levels of ability and attainment. 

By the efficiency of an item we mean the extent to 
which it pulls its weight, so to speak. If the purpose of a 
test is to discriminate between able and less able pupils 
an efficient item is one which demonstrably serves this 
purpose. To ensure that it is doing so one needs not only ta 
discover the percentage of children who answer it cor- 
rectly but also to find out what kind of children they are. 

1EM—E 57 


THE EFFICIENCY OF MEASUREMENT 


r example, suppose we find that a particular item was 
ane He re by 60% of the children. This means 
that, in terms of facility value, it has almost certainly 
proved to be acceptable. Suppose, however, that we now 
examine the performance of the pupils on the test as a 
whole and divide them into three groups on the basis of 
their total scores. If we now found that the 60% who had 
answered the item correctly were equally represented in 
these three groups—20% in the top third, 20% in the 
middle third, and 20% in the bottom third, it would be 
clear that the item was not performing a useful function. It 
has manifestly failed to distinguish between the good, bad 
and indifferent with respect to the qualities that the test 


has been designed to assess. It is therefore inefficient, in the 
sense in which we are using the term. 


(a) 


Proportion of 
students answering 
item correctly 


Ist. 2nd. 3rd. 4th. Sth. 6th. 
Total scores divided into sixths 


(b) 


Proportion of 
students answerin 
item correctly 


lst. 2nd. 3rd. 4th. Sth. 6th. 
Fig. XVII 
58 


THE EFFICIENCY OF MEASUREMENT 


The efficiency of an item may be assessed in the way we 
have just indicated. That is to say, on the basis of their 
total scores the pupils are arranged in rank order and 
then divided into thirds, or sixths. For each item the pro- 
portion of the pupils in each sub-division who answered the 
item correctly is determined. A satisfactory estimate of 
efficiency is yielded by subtracting the proportion found 
in the lower third from that in the upper third. For 
example, in the situation described above, 60% answered 
the item correctly and of these 20% were in the top 
third and 20% in the bottom third. The efficiency would 
therefore be zero. If however the percentages had been 
differently distributed—say 40% in the top third, 15% 
in the middle third and 5% in the bottom third, the 
estimate of efficiency would now become 35. There are 
more sophisticated ways in which this characteristic of 
an item may be assessed but this simple method is satis- 
factory for most purposes. Indeed a rough check on 
efficiency may be obtained by representing the proportions 
graphically. In Fig. XVII, section (a) illustrates an 
efficient item, and section (b) an inefficient one. 


A further check has to be made if the items are of the 
multiple-choice type. This involves examining the dis- 
tribution of the choices made by the pupils. Consider for 
example an item involving five choices: A, B, C, D, E of 
which D is the correct one. If 60% of the pupils have in 
fact selected D the item would appear to be satisfactory 
in terms of facility value. It is important however to 
consider how the 40% of incorrect responses have been 
shared among the available alternatives. 

The following are two possibilities : 


(a) A B C D E ; 
Io 10 10 6o rọ percentage of choices 


(b) A BODE , 
3 o 35 6o 2 percentage of choices 

The first of these possibilities shows an acceptable distribu- 
59 


THE EFFICIENCY OF MEASUREMENT 


tion. The correct answer, D, has attracted 60% of the 
choices—a satisfactory facility value—and the remaining 
choices are evenly distributed among the other answers 
available. The second distribution reveals the same facility 
value and might on that account be regarded as equally 
satisfactory. It is clear, however, on a closer inspection, 
that there is something amiss. A wrong response, C, has 
attracted far too much attention—possibly because of 
some ambiguity in the wording of the item. A conscientious 
test constructor would throw this one in the waste-paper 
basket and look for a replacement. 

The methods of item analysis that we have described 
so far are those that are used in the construction of objec- 
tive tests. Essay-type questions present a rather more 
difficult problem in this respect, but two possibilities sug- 
gest themselves. The first is that when such questions are 
used a relatively detailed, objective scheme of marking 
might be designed. This is being done to an increasing 
extent in many external examinations. Essay-type ques- 
tions are set in these examinations but they are carefully 
phrased SO as to elicit certain predictable responses. The 
examiners then confer to determine the marks to be 
allotted to each specific fact, argument, etc. that a candi- 
date includes in his answer. If separate elements can be 
identified in this way and marked right or wrong (or pre- 
sent or absent) the methods of item analysis that are applied 
to objective tests can of course be used. The second possibi- 
lity is to modify these methods and to undertake a com- 
parable but somewhat cruder form of analysis. If we elect 
to frame examination questions which elicit a diffuse 
response and which we propose to mark by subjective 
impression we can nevertheless carry out an inspection 
Ed the following kind after the marking has been completed. 
ao cs ad the average mark for each question and 
weated arā facil percentage. This quantity can then be 
a question which value. Thus we might choose to regard 

q yielded average marks between, say, 
30% and 70% as having appropriate level of difficulty 


= those falling outside these limits as unsatisfactory. 


THE EFFICIENCY OF MEASUREMENT 


We are not suggesting of course that teachers could or 
should carry out an elaborate procedure of item analysis 
with regard to every question in every examination that 
they devise, All that we are urging is that the principle 
underlying this type of analysis should be recognised and 
that some empirical checks should be undertaken. Those 
who construct standardised objective tests build up an 
‘item bank’—that is, a store of items that have been tried 
out, item analysed, and passed as fit for human consump- 
tion. Such items may be classified not only with reference 
to their facility and efficiency values, but also in terms 
of the kinds of objective that they serve to appraise. 
Appropriate items may be withdrawn from this bank when 
ever it is necessary to compile a test of a specific kind. 
Teachers could profitably adopt this practice. They could 
carry out a selective item analysis. During the course of 
marking a set of scripts one can usually detect those items 
which the pupils have found very easy or very difficult 
or which have proved to be unsatisfactory in some other 
way. Such suspect items could be singled out for analysis. 
Similarly it is usually possible to recognise the items that 
seem to have suited one’s purpose unusually well. Some of 
these too could be more closely analysed. The net result 
of such an undertaking could amply justify the labour 
involved. One could gradually accumulate a stock of 
tried and tested questions which could be used with con- 
fidence and earlier errors of judgment could be avoided. 


Reliability 


Having carefully examined its separate components the 
hext step is to satisfy ourselves that the examination as a 
Whole is behaving satisfactorily. There are two related 
attributes that a good test or examination must possess. 
It must be reliable and valid. . 

A reliable examination is one that yields consistent 
results. A valid examination is one that demonstrably 
Measures what it was intended to measure. It is important 


to be clear about the relationship between these two a 
I 


THE EFFICIENCY OF MEASUREMENT 


butes, which is not a reciprocal one. An examination 
cannot be valid unless it is reliable; but it can be reliable 
without being valid. Suppose for example that we set = 
to determine the weights of a group of people and we use 
a faulty weighing machine—one that indicated see es 
ranging from five to fifteen stones when the same individua 
stood on it on successive occasions. There would be little 
point in recording the results: the unreliability of our 
measurements has, in this instance, made it impossible to 
obtain a valid assessment. One the other hand we might 
use instead of the useless weighing machine a measuring 
rod that has been accurately marked off in inches and 
perhaps affords readings to two places of decimals. We 
would now have a highly reliable instrament—but not, 
of course, a valid measure of weight. Nps: 
We must therefore consider these two characteristics 
separately. We must ensure that any examination we use 
yields consistent results. But we cannot assume, if it does, 
that is necessarily yielding the right results, in the sense 
that it is enabling us to assess the particular qualities or 
characteristics with which we are concerned. . 
Let us consider first the reliability of an examination. 
We propose to discuss the factors that affect reliability; 
the precautions that a teacher can take to ensure that 
the tests and examinations he uses are as satisfactory 
aS Possible in this respect; and, briefly, the ways in 


Which the reliability of test or examination can be 
assessed, 


The content and construc 


tion of the examination itself 
can affect the consistenc 


y of the scores that it yields. In 
the preceding discussion of item analysis we pointed out 
that this Procedure can reveal inadequacies in the wording 
of questions and items. It is clear that any question that 

or ambiguously worded as to leave the 
to the kind of response that is expected 
e the accuracy of our assessments. Simi- 
ination as a whole cannot be reliable if it 
: ly sample the skills or knowledge under 
review. It follows 


r from this that the reliability tends to 


THE EFFICIENCY OF MEASUREMENT 


increase in proportion to the number of items or questions 
that it contains. The extreme case—an examination con- 
sisting of one brief question—would obviously leave too 
much to chance. One could conceivably select the only 
question that a particular pupil could answer, which might 
also be the only question which another pupil could not 
answer, and so a succession of such examinations could 
yield a series of markedly different rank orders. The more 
questions that an examination contains, especially if these 
are so chosen as to ensure that each relevant aspect of the 
subject-matter is dealt with, the more reliable the results 
will become. 

Thus the kind of preparation we have outlined earlier— 
planning an examination in accordance with a carefully 
designed blue-print—together with some form of item 
analysis, will go a long way towards ensuring that the 
examination is reliable. 

The consistency of the results will also depend on the 
methods of marking that are employed. There is abundant 
evidence to show that those assessments which rely on 
Subjective impression vary not only from one examiner 
to another but also from one occasion to another when a 
Single examiner is involved. It is clear therefore that if 
€ssay-type questions are included in an examination, the 
interests of reliability will be served if a detailed, objective 
Scheme of marking can be devised. The ideal is to try to 
Organise a system of marking which would yield compar- 
able results if a colleague were invited to reassess the 
Pupils’ scripts independently. , 

A third source of unreliability is one that lies mainly 
Outside our control. In educational measurement we are 
concerned with inferences about the knowledge and skill 
that children have acquired which are based on their 
Performance on a particular occasion. We have no guaran- 
tee, of course, that this performance, even when we are 
able accurately to assess it, does full justice to their capa- 
bilities. The extent to which a child demonstrates effec- 
tively the knowledge or skill that he possesses—even when 


the examination affords him the fullest opportunity to s 
3 


THE EFFICIENCY OF MEASUREMENT 


so—depends on a variety of personal factors. His health, 
emotional state, mood, and motivation will all affect the 
outcome. There is little that a teacher can do to guard 
against this potential source of unreliability. He can of 
course arrange for an examination to be given at the 
beginning of the day when the children are less likely 
to be tired; he may also be able to arouse motivation and, 
if he has established good relations with his pupils, he may 
successfully allay any anxiety that they might otherwise 
feel. He cannot be expected, however, to control or even 
to recognise results. 

It is this latter group of factors which makes it difficult 
to assess the reliability of educational measurements. An 
obvious method of doing so would seem to be to repeat 
them, after a short interval, and to check by the process of 
correlation to which we referred in the previous chapter 
that the results are consistent. In dealing with physical 
measurements this would be an acceptable procedure. We 
could be satisfied with the reliability of a weighing 
machine, for example, if we found that it recorded the 
same result when the same object was weighed on succes- 
sive occasions. There is the possibility of human error in 
such measurements of course—we could misread the dial 
for instance—but, with care, this can be reduced to neg- 
ligible proportions and any persistent error that we detect 
can be justifiably attributed to the unreliability of the in- 
strument. If, however, we administer an examination twice 
to a group of pupils we are not dealing with a comparable 
situation. We cannot ascribe all the differences that 
emerge to defects in the examination itself or to the errors 
ar marking. Some of them may reflect changes in the 
eei are assessing, Even if only a: sion 
ei the mo wil è two occasions it is possible that some 
alapen tie z Mien Increased their knowledge, or de- 
i hele per, 5o as to justify a genuine improvement 

teas per: ormance. It is also possible that experience 
S eatin will help them to acquit themselves 
itably on the second occasion. ‘Practice effect’ is 


s : requently encountered phenomenon in educational mea- 


THE EFFICIENCY OF MEASUREMENT 


surement. Since we cannot hope to disentangle the changes 
due to these factors from those which reflect the imper- 
fections of the examination this method of estimating its 
reliability is not very satisfactory. An alternative approach 
is to use for this purpose two parallel tests or examinations. 
By this means the effects of practice can be reduced and 
One can more directly ascertain the shortcomings of the 
tests themselves. It is difficult, however, to devise two tests 
or examinations that are strictly parallel. To make sure 
that they measure the same aspects of knowledge or skill 
and, more particularly, that an item in one test corresponds 
exactly, in terms of facility, efficiency etc., with its counter- 
part in another, would require unusually careful construc- 
tion and analysis. For these reasons estimates of reliability 
have been sought which can be derived from the results 
of a single test or examination. 

One method of achieving such an estimate is by dividing 
the test into two parts and discovering the correlation be- 
tween them. This is sometimes referred to as the ‘split-half’ 
method. Suppose for example one wished to test a home- 
made twelve-inch ruler. One might suspect that here and 
there the divisions had been incorrectly marked and that 
therefore some ‘inches’ were longer than others. This could 
obviously be checked by breaking the ruler into two parts 
and measuring a series of lines with each half in turn. The 
higher the correlation between the two sets of measure- 
ments, the more satisfied we could feel that our suspicions 
were groundless, Applying this principle to a test or exam- 
ination one has to ensure first of all that the two halves 
are comparable. In measuring the reliability of objective 
tests this is usually done by assigning the odd items to one 
half and the even items to the other. Secondly it is neces- 
Sary to allow for the fact that the correlation one obtains 
from this comparison is based on tests that are shorter 
than the original. As we pointed out above the reliability 
Of a test depends in part on its length: other things being 
Equal the longer a test the higher its level of reliability. 
Thus the split-half method would yield an underestimate 
of reliability unless we employ a formula which ee 

5 


THE EFFICIENCY OF MEASUREMENT 


sates for the fact that we are examining a test which is 
just half as long as the original. Such a formula has been 
devised and is referred to as the ‘Spearman-Brown’ formula. 


According to this the reliability of a test is te where r is 


the measured correlation between the two halves of the 
test. 

In this connection it should also be noted that whereas 
a test or examination as a whole may be demonstrably 
reliable, it does not follow that we can treat sub-totals— 
for example the marks allotted to different parts or sections 
of it—as being equally reliable. We may be justified 
in comparing the marks of individuals on the test as a 
whole, but we ought to be more chary of basing such 
comparisons on their performance in certain sections 
of the test. It is clear from the preceding discussion that 
in the latter case we are dealing with a lower level of reli- 
ability. 

There are other, more sophisticated ways of assessing 
reliability—the ‘Kinder-Richardson’ method for example 
which bases the estimate on the kind of evidence which is 
yielded by item analysis—but we are not here primarily 
concerned with techniques for measuring reliability. (The 
text-books to which we refer provide appropriate guidance 
to any teacher who wishes to undertake such measure- 
ments). Our major concern in this regard is to empha- 
sise the need to ensure that any tests or examinations 
that we use are as reliable as possible and to recognise, 


when interpreting the results, that perfect reliability is 
unattainable. 


Validity 


We have defined the vali 
the extent to which it se 
been designed, and we h 
So unless it is also a rel 


dity of a test or examination as 
rves the purpose for which it has 
ave pointed out that it cannot do 
y at iable instrument. It is clear from 
eee that since there are various purposes for 
y ich examinations can be employed there are several 


THE EFFICIENCY OF MEASUREMENT 


distinctive kinds of validity. A particular examination may 
be adequate for one purpose but totally unsatisfactory if 
required to serve another. Thus for example we might 
devise an examination which everyone agrees provides 
a valid assessment of the history we have tried to teach 
to a group of primary school pupils. But if we use this 
examination as a means of predicting their success in a 
secondary school course, we may find that it is less satis- 
factory for this purpose than other measures—intelligence 
tests or tests of attainment in English, for example. This ex- 
ample enables us to distinguish between two of the major 
kinds of validity with which teachers are concerned—con- 
tent validity and predictive validity. By content validity 
We mean the extent to which an examination adequately 
samples the area of knowledge or skill with which it is 
concerned. Validity in this sense can best be ensured by 
following the processes that we have recommended in 
earlier chapters—basing the construction of tests and 
examinations on a detailed blue-print so that it can be 
measured by means of rational analysis. To assess the 
content validity of an examination one needs to relate it to 
the course of instruction that it is designed to appraise. 
What were the aims of the course? Do the questions in the 
examination afford a serviceable means of determining 
the extent to which these aims have been realised? What 
were the contents of the syllabus? Does the examination 
adequately sample these contents? Do the syllabus and the 
examination correspond in terms of priorities, distribution 
of emphasis? These are the kinds of question that are 1m- 
volved in the assessment of content validity. Obviously we 
should try to answer them ourselves. Equally obviously it 
Would be advantageous to secure a second opinion from 


Colleagues who are qualified to judge. ; 

By predictive validity we mean the extent to which an 
examination provides a satisfactory forecast of, for ex- 
ample, our pupils progress and attainments. We may wish 
to use tests and examination in order to allocate pupils 
to suitable courses, classes, streams or sets. This kind 
of Validity calls for empirical assessment. To determine = 

7 


THE EFFICIENCY OF MEASUREMENT 


extent to which our measurements have served their pur- 
pose we would need to relate them to some criterion. The 
validity of eleven-plus procedures has been extensively in- 
vestigated, for example, by comparing the results of the 
tests with the subsequent performance of the pupils in 
their secondary school courses. By the same token any 
test or examination that is used for predictive purposes 
within a school should be checked by some comparable 
form of follow-up study. 

Other types of validity are frequently mentioned in the 
literature of educational measurement. These refer to ad- 
ditional ways in which the effectiveness of a test may be 
assessed. Congruent validity refers to evidence obtained 
by correlating a test with another established instrument 
for measuring a particular kind of performance. Thus if 
someone wished to construct a test of, for example, spatial 
ability, the most convenient way of validating the new 
test might be to compare its results with those provided 
by a standardised test of this same ability whose validity 
had been determined by other means. Concurrent validity 
refers to a comparison between a test and some other form 
of assessment carried out at the same time, Some of the 
early intelligence tests for example were regarded as ade- 
quate instruments because they were found to correlate 
highly with teachers’ assessments of their pupils’ intelli- 
gence—assessments based on day-to-day observations of 
their academic Progress and problem-solving abilities. Fin- 
ally we would refer to face validity which is a term used 
to indicate the acceptability of a test. However efficiently 
a test might be constructed, it cannot usefully serve if the 
oor ee So to speak, are not convinced of its adequacy- 
inde oe cannot be effective if pupils find it 
c o tollow the necessary instructions—if for 
instance the spaces in which they are required to write 
r responses are too small or are difficult to locate. 
mee Ctive if teachers find it difficult to 


we a general conclusion to this section we would em- 


THE EFFICIENCY OF MEASUREMENT 


phasise that in choosing tests for use in the classroom a 
teacher should look for evidence of their reliability and 
validity and that in constructing his own tests and exam- 
inations he should keep in mind the factors that influence 
these two important attributes. 


69 


6 


Expressing the results 


Having devised and administered an examination and satis- 
fied ourselves that it is a valid and reliable instrument, we 
have to choose some suitable form in which to express 
the results. It is generally accepted that the ‘raw’ marks, as 
they are often called, are meaningless except, perhaps, to 
the person who has assigned them. Any number of marks 
may be allotted, quite arbitrarily, to each question or sec- 
tion in an examination; the totals therefore, convey very 


little information about the quality of an individual’s per- 
formance, 


Limitations of percentages 


A widely adopted practice, which is intended to make 
marks more meaningful, is to convert them into percent- 
ages, There is something solid, respectable and seemingly 
definite about Percentages. Parents undoubtedly prefer 
them to any other form of communication. Face-to-face 
discussions or written reports leave the average parent un- 
comfortably uncertain about his child's progress. Say ‘Eng: 
lish, 60%’ and he feels that at last he has been given a 
clear, unambiguous indication of his child’s level of attain- 
ment. Unfortunately he has not necessarily been given 
anything of the sort. Marks expressed in this form float 
uneasily in a kind of limbo, 60% would mean something, 
perhaps, if we could define Precisely what 0% and 100% 
would convey. If the former indicated—which is highly 


unlikely—a total lack of acquaintance with the English 
70 


EXPRESSING THE RESULTS 


language and the latter—which is even more unlikely— 
a state of unsullied perfection, an intermediate mark might 
indicate the amount of progress a child had made. Parents 
and children are rarely perceptive enough to ask the per- 
tinent question: 60% of what? The honest answer to this 
question is of course that we are quoting percentages of 
the total mark that we were prepared to give in this parti- 
cular examination, which could easily have been made 
much more or much less difficult. Had we chosen to make 
the examination easier (or if an alternative form had for- 
tuitously turned out to be easier as far as this particular 
child is concerned) the 60% might have become 90% or 
more. Conversely, if we had been in a sour mood when we 
drafted questions, the child might have earned no more 
than 10%. 

Perhaps there is little harm in allowing parents and 
children to assume that percentage marks have an absolute 
value. It can be dangerously misleading however to permit 
them to form judgments on the basis of comparisons be- 
tween percentages. If we announce that a child has been 
awarded 60% in English and 75% in mathematics we are 
inviting conclusions to be drawn and decisions to be made 
which could well affect his subsequent educational career. 
In these circumstances a child and his parents may assume 
that he is ‘better’ at mathematics than at English. As we 
have seen, examination marks expressed in this form can- 
hot support such an inference. In Chapter 2 we discussed 
the different levels of measurement that can be employed 
and the treatment that can be appropriately applied to 
each, The full range of arithmetical operations that we 
May wish to perform on examination results in order to 
combine or compare them is admissible only when we are 
dealing with a ratio scale—a scale, that is, with equal inter- 
vals and an absolute zero. Examinations of the kind we are 
discussing do not satisfy these conditions. 

If the scales we use cannot be tethered at either extreme 
how can we make sense of the marks we accord to an 
individual’s performance? One solution is to look for some 
alternative reference point. We cannot satisfactorily iden- 

71 


EXPRESSING THE RESULTS 


tify or define the minimum or maximum, but we can de- 
termine the mid-point or average—and this can be done 
empirically and objectively. For any group of children one 
can ascertain the average performance and then proceed 
to relate each individual’s performance to this. Expressed 
in this way marks at once become more meaningful. To dis- 
cover that a child has x marks more than some undefined 
zero level tells us little or nothing about his performance; 
but to learn that he has x marks more or less than the aver- 
age for some group reveals his relative status within that 
group. 


Mental ages 


One of the first attempts to express scores in this way— 
to relate them, that is, to established norms—was under- 
taken by Alfred Binet, the pioneer of intelligence tests, 
who introduced the concept of mental ages. Binet tried 
out items or questions on large numbers of school children 
of different chronological ages. By so doing he was able 
to identify the questions that were appropriate for a par- 
ticular year group. Thus he was eventually able to devise a 
set of questions that could be tackled successfully by eight- 
year-olds, but not seven-year-olds, another that could be 
done by nine-year-olds but not by the younger age-groups, 
and so on. He was then able to identify the relative level 
of any particular child’s performance. If a nine-year-old 
child, for example, could successfully deal with the test that 
was, in general, appropriate for eleven-year-olds he was 
said to have a ‘mental age’ of eleven. Another nine-year- 
old who could not get further than the seven-year-old level 
Was assigned to a mental age of seven. Other age-norms 
were later introduced into educational measurement— 
reading ages, arithmetic ages, etc. They were determined 
ie the same way, by discovering the level of a child's 
attainments in terms of the chronological age for which 
1t was found to be appropriate. 

Pi expressed in this form are clearly more meaning- 
A an raw totals or percentages but may nevertheless 


EXPRESSING THE RESULTS 


be misleading in some respects. Consider, for example, the 
following two children. One is a bright six-year-old who is 
able in a Binet-type test to put up a performance equivalent 
to the average for eight-year-olds. He is accordingly 
awarded a mental age of eight which, in a sense, is an ade- 
quate indication of his intellectual status. The other is a 
relatively dull ten-year-old who also performs at the 
eight-year-old level and earns a mental age of eight. On 
the face of it these two children might be regarded as 
equals. Might they not then be regarded as equals for educa- 
tional purposes? Should they be put in the same class or 
group? As soon as the question is posed in this form its 
absurdity is at once recognisable. These two children are 
clearly not candidates for the same kind of educational 
treatment. Apart from the fact that they are likely to 
differ markedly in terms of social and emotional maturity, 
their apparent comparability with respect to their intellec- 
tual status is illusory. If one were to plot the course of their 
intellectual growth it would be obvious that the first of 
these children is developing at a rapid rate, the second 
much more slowly. The first is probably somewhere 
near the beginning of a steeply rising curve; the second may 
be approaching the point at which his rate of intellectual 
growth will decelerate markedly. Thus it would be folly 
to regard them as in any way comparable scholastically. 
It is purely fortuitous that their paths have crossed, so to 
Speak, at this particular moment in time. 

This and other short-comings involved in the use of men- 
tal and other ‘ages’ to represent performance in tests led to 
the introduction of intelligence and attainment quotients. 
This involved relating mental age and chronological age 
(and it became the convention to multiply this ratio by a 
hundred). The difficulty to which we have just referred is 
clearly obviated by this means. When the performances of 
the children in the example discussed above are expressed 
In terms of intelligence quotients instead of mental ages 
they are seen to be markedly disparate. Although they have 
the same mental age, the intelligence quotient of the first 
child is approximately 133 and that of the second only 80. 

IEM—F 73 


EXPRESSING THE RESULTS 


Incidentally it is clear from this illustration that mental 
age reveals an individual’s relative status at a given point 
in time, whereas intelligence quotients provide an indica- 
tion of the rate of intelligence growth. 

Athough scores that are expressed as quotients are 
more useful for many purposes than those in the form 
of mental or attainment ages they have certain disadvan- 
tages. One of these is that tests of the Binet type—those 
based on age norms—do not manifest a consistent standard 
deviation throughout the age range. In other words, the 
spread of the scores or the extent to which they tend to 
depart from the mean varies from one chronological age 
level to another. This means that intelligence quotients, 
for example, that are obtained at different ages are not 
strictly comparable: two dissimilar scores may relate to 
what are in effect equivalent levels of performance or, al- 
ternatively, identical scores may mask a genuine change 
in relative status. 

A further difficulty is encountered when one wishes to 
gauge the performance of adolescents and adults. Mental 
age scores, as derived from the kinds of tests we have been 
considering, tend not to increase after late adolescence and, 
consequently, since chronological age continues of course 
to increase ata steady rate, performance measured in intel- 
ligence quotients shows, from the early twenties onwards, 
a progressive approach towards imbecility, Because age- 
norms have these—and other—disadvantages it has become 
the Practice to use some other form of score which relates 
an individual's ability or attainments to those of a specified 
group of people; this may be a particular age-group: for 
example, there are tests which have been applied to large 
representative samples of children within a prescribed age- 
range—from ten to eleven and a half, for instance. A 
score on such a test indicates where a child is to be located, 
. it were, along the continuum of performance manifested 
= = group. Age of course need not be the criterion, oF 

e only criterion, for selecting the reference group. There 
are tests which have been standardised on selected popu- 


an such as grammar school sixth-formers or university 


EXPRESSING THE RESULTS 


undergraduates. A score on a test of this kind enables one 
to compare an individual with members of this high- 
ability group rather than with the population at large. 


Percentile ranks 


Scores which serve to identify a person’s status within a 
specified group, may be expressed in a variety of forms. 
One convenient way of indicating the level of an indivi- 
dual’s performance is to quote his percentile rank. This 
tells us what percentage of the group performed at a lower 
level. Thus if we compare a person’s mark with those ob- 
tained by the group as a whole and find that, when these 
marks are arranged in rank order he is exactly half way 
down the list, he would be said to be at the goth percentile. 
In other words 50% of the group were below him in the 
list. If he fared better than 90% of the group he would 
have a percentile rank of 90, and and so on. 

i It is clear that this is a much more meaningful representa- 
tion of an individual’s performance than his total mark or 
percentage mark in an examination. This is especially so if 
the reference group is a large and representative one. To be 
told that a child is at the 95th percentile in a test adminis- 
tered to a representative sample of his age group leaves 
us in no doubt about the quality of his performance. 

On the other hand it should be recognised that by trans- 
forming marks into percentile ranks we have not produced 
an interval scale. In other words the intervals between per- 
centile ranks are not equal throughout the range. The dis- 
tance between the goth and the 55th percentiles, for 
example, is likely to be much smaller than that between 
the goth and gsth because, in any test or examination, 
there is a tendency for more people to congregate around 
the mid-point of the scale than at either extreme. Thus 
it might be possible to move up from the soth to the 55th 
Percentile just by getting an extra mark or two, whereas 
Moving from the goth to the 95th would entail much 
greater effort. Since we are not dealing with an interval 
Scale we can, as we saw in Chapter 2, perform only a 

75 


EXPRESSING THE RESULTS 


limited number of operations on the results. For example 
we cannot legitimately combine percentile ranks or aver- 
age them. 

Clearly, if we wish to treat results in this way we need to 
raise the level of measurement we employ. We would again 
emphasise that the scores we have so far discussed—in- 
cluding those expressed as percentile ranks—are, in effect, 
ordinal scales. We may represent the performances of jndi- 
viduals in quantitative terms but since we have not ar- 
ranged for the intervals between the scores to be equal 
in size we are dealing with what is virtually an order 
of merit. Thus when we use percentile ranks we can state 
how Many people scored more or less than a particular 
individual which makes our statement more significant 
than the quotation of a total mark or percentage alone, but 
we cannot determine the extent to which one individual's 
performance is better or worse than another's. To do so 
we need an interval scale, That is to say we need to trans- 


late our original or ‘raw’ marks to a scale which has more 
or less equal units. 


Standard scores 


To explain how this can be achieved we must first of all 
refer back to our earlier discussion of the ways in which 
one set of marks may be distinguished from another. We 
suggested that it was necessary to discover three attributes: 
a measure of central tendency—usually the mean; a mea- 
Sure of the ‘spread’ of the marks—for which the standard 
deviation is the most serviceable; and the shape of the 
distribution. These three distinguishing features adequately 
characterise any set of marks we have to deal with. We 
also pointed out that, in educational measurement, a 


commonly encountered form of distribution was the 
normal curve. It is by making use of these three attributes 
Paes standard deviation, and distribution in the form 
oe normal curve—that an interval scale can be devised- 
€ procedure that has been widely adopted in the con- 


= a and standardisation of objective tests is tO 


EXPRESSING THE RESULTS 


arrange for the scores that are eventually used—as dis- 
tinct from the ‘raw’ marks—to be distributed in accord- 
ance with the normal curve; to use the mean score as the 
reference point; and to express all the scores in terms 
of standard deviations or fractions of a standard deviation 
above or below the mean. These are known as standard 
scores. Standard scores take various forms but before 
discussing these in any further detail it is necessary to 
consider why the normal curve plays such a prominent 
part in these arrangements. 

A normal curve of distribution—or the curve of error 
as it is sometimes called—tends to be found when a 
variable is subjected to a number of randomly operating 
influences. For example, if we drew a line on a black- 
board, invited a large number of people to estimate its 
length, and recorded these estimates, we would find that 
they were normally distributed. Some would over- 
estimate the length, others would under-estimate it; 
furthermore most of the estimates would be fairly close 
to the measured length and relatively few of them wide 
of the mark. This is the form of distribution that we find 
too when we record the means of a large number of dif- 
ferent samples drawn from a population; most of these 
means will cluster about the true mean—that is, the 
mean of the population as a whole—but a few will be 
very much greater and a few very much smaller than this. 

A number of human attributes have been found to be 
normally distributed—because, presumably, they are 
determined by random genetic influences. Height, for 
example, is distributed in this form, but not weight, which 
tends to be affected by more systematic influences. In 
educational measurement we frequently encounter this 
type of distribution—not because we have discovered that 
abilities, aptitudes and attainments are necessarily dis- 
tributed in this way (although everyday experience Sug- 
gests that in most fields of human endeavour something 
like a normal distribution is to be expected) but because 
We take steps to ensure that marks and scores are dis- 
tributed in this fashion. And we do this because we are 

77 


EXPRESSING THE RESULTS 


familiar with the properties of the normal curve and can 
make use of them to produce scales with approximately 
ual units. : 
“awe show below a normal curve divided up into 
areas limited by units of standard deviation. The advant- 
ages of using this form of distribution are readily recog- 
nisable. When scores are distributed in this form we 
can calculate the proportion of cases that is to be found 
between any two points on the scale. Thus between the 
mean and one standard deviation above the mean we can 
expect to find 34.1% of the individuals concerned, a 
further 13.6% between plus one and plus two standard 
deviations above the mean, and so on. In interpreting 
Scores expressed in this form it is useful to remember 
that approximately two-thirds of the cases lie between 
plus and minus one standard deviation from the mean, 
95% between plus and minus two standard deviations 


and virtually all (99.8%) between plus and minus three 
standard deviations, 


Percent of cases 
under portions of 
the normal curve 


Z scores 2 = 0 7 +2 3 
i 2 H T 1 i le, 
Cumul% oiea ziar 159% s010% att 97;7°/e 999/0 
. i i í j i 
Percentile |. 1 r clog 
equivalents i 1 | 5 10 ;20 3020506070 80; 90 95 \ 99 f 
1 ! 1 
z 1 i 1 1 I 
Standardised 1 a l i 1 rH 
scores 55 70 a5 100 iis 30 145 
t 
1 i i i H i i 
a ! 1 ! H i 4 
Tscores 26 30 z0 50 e0 70 ee 
i i i i { i 1 
I i H i ' | 4 
H 
Stanines ! T 21S E E C A E ‘| 
és t [i 
‘loin stanines 1 eje Tla 124a 377s 20940 17%. 12% 70 a os 
Fig. XVIII 


78 


EXPRESSING THE RESULTS 


The diagram on page 78 also illustrates the relationship 
between some of the varieties of standard scores that are 
in common use. Clearly if we are expressing scores in 
terms of standard deviations or fractions of a standard 
deviation from the mean, the quantities we assign to these 
measures are purely a matter of convenience. Along the 
base-line of the normal course we indicate the mean as 
© and the standard deviations as units +1, +2, —1, —2, 
etc. Such scores are commonly referred to as z scores. 
The next line shows the cumulative percentage of cases 
falling beneath each part of the curve: thus starting at 
the lower end of the scale, by the time we reach minus 
one standard deviation we have accounted for 15.9% of 
the cases, by the time we reach the mean, 50%, and so 
on. Then we show the positions in relation to this curve 
of the percentile ranks. (Incidentally it can be clearly 
Seen that, as we pointed out earlier, the intervals between 
Percentile ranks are not equal.) The remaining lines illus- 
trate some other conventional ways of expressing standard 
Scores. Most of the standardised tests that teachers en- 
Counter adopt the practice of assigning a value of 100 
to the mean and rç to the standard deviation. Again 
the significance of such scores can be seen by relating 
them to the curve. Thus we can say of any child who 
Scores 115 or above on such a test that he belongs to the 
top 16% (approximately) of the age-group on which the 
test has been standardised. Some tests are expressed as 
T scores. These have a mean of 50 and a standard deviation 
Of 10. The child who obtains 115 on a test scored in 
Standardised scores would be accorded a score of 60 on a 
T score test. Finally, we occasionally encounter scores 
Quoted in stanines. These have a mean of 5 and a standard 
deviation of just under 2, dividing the range into nine 
divisions as shown. 

_Reference to the properties of the normal curve of 
distribution also enables us to appreciate rather more 
clearly the significance of the standard error of a measure- 
Ment which we discussed in general terms in an earlier 
Chapter (see page 54). The score actually obtained by an 

79 


EXPRESSING THE RESULTS 


individual in a test is in effect only one of a large number 
of possible scores that could be obtained if the test were 
repeated many times without memory or practice having 
any effect. In these circumstances, we may assume that 
these scores, being subject to errors of various kinds would 
be normally distributed and that the standard error is in 
fact an estimate of the standard deviation of this distribu- 
tion. We can now see why we usually double the standard 
error in order to determine the range within which the 
true score is likely to be found. The obtained score plus 
or minus the standard error would occur in approximately 
two thirds of the testing occasions. As we can see from 
the diagram on page 78, however, if we double the 
standard error we are including approximately 95% of 
these hypothetical occasions. In other words, there is only 
one chance in twenty that the ‘true’ score lies outside 
this range. For example, if a test is shown to have a 
standard error of measurement of 3 points, then we can 
be reasonably certain that the ‘true’ score of a child 
obtaining 110 lies between the limits rr0 + 2 X S.E.M., 
that is between the limits 104-116, E 

The account we have provided of the ways in which 
the scores of standardised tests are expressed is intended 
to illustrate the problems involved in translating raw 
marks into a meaningful form and to provide teachers with 
some understanding of the terms that are used in tech- 
nical discussions of tests and examinations. It may also 
serve two further purposes. We have tried to make clear 
the limitations of the marks that are yielded by the 
tests and examinations that a teacher sets in the class- 
ar Oe ne out, for example, that they cannot legiti- 
ea aphal ee p ombine in their raw state. be 
ailas i any single mark may be regarde 

y a sample of all the possible—and different—marks 
that an individual might earn if the same examination, 
or comparable forms of it, were administered on a num- 
ber of occasions. Thus there is an element of error 
attached to every measure which we should allow for in 


a interpretation we undertake, Furthermore in seeking 


EXPRESSING THE RESULTS 


to interpret the mark obtained by a child we should bear 
in mind, not only that we are dealing with a rough and 
tentative estimate of his performance but that the perform- 
ance itself is dependent on a variety of influences—his 
mood and level of motivation at the time, for example, and 
his experiences prior to the examination. In other words 
an examination mark—however carefully the examination 
may have been devised—is only one of a number of items 
of evidence that must be considered in reaching an 
appraisal of a child’s progress and attainments. 

_ A further purpose that this chapter is intended to serve 
is to provide some insight into the reasons why teachers 
may be required to perform various operations on their 
examination marks or to produce them in a prescribed 
form. They may, for example, be invited to translate them 
into percentile ranks which, as we have seen, helps to 
offset the disadvantages of having to deal with sets of 
marks that may differ in mean and spread. Percentile 
ranks, however, suffer from the disadvantage already 
discussed that they cannot be added or averaged. Hence 
the possibility of transferring examination marks to a 
scale with normally distributed scores—such as T-scores 
—must also be considered. Alternatively teachers may be 
required to ensure that their marks conform, approxi- 
mately at least, to the normal curve of distribution. This 
can be done by determining the proportion of pupils to 


Examination Percentage 

mark of candidates 
95 —100 I 
55 = 94 3 
75 — 84 7 

65 — 74 12 
55 — 64 17 
45 = 54 20 
5a = a4 17 
25 = 34 12 

15 — 24 7 

So ae 3 

o- 4 I 

Fig. XIX 


81 


EXPRESSING THE RESULTS 


be accorded marks within a specified range. Fig. XIX 
illustrates the way in which a normal distribution could be 
produced. 

If requirements of this kind are fulfilled each set of 
examination results will virtually be on the same common 
scale and the marks may therefore be justifiably compared 
or combined. The observant reader will notice that the 
percentages of candidates in the scale given above, corres- 
ponds to the percentages on the stanine scale given on 
page 78. It does not fall within the scope of this introduc- 
tion to furnish details of the various methods that can be 
employed for these purposes: we are not so much con- 
cerned with the exercise itself as with clarifying its 
objective. Easy-to-follow instructions for translating raw 
marks into a more meaningful form will be found in the 


books quoted in the suggestions for further reading for 
this chapter. 


82 


7 


Varieties of measurement 


What to measure 


Educational measurement, by our adopted definition, 
involves assigning numerals to objects and events in 
accordance with specified rulers. We have discussed the 
rules governing the design and application of measuring 
instruments and the ways in which the numerals used to 
express the results that these instruments yield may be 
meaningfully interpreted; it is to the objects and events 
with which educational measurement is concerned that 
We must now give closer consideration. . 

If we examine a catalogue of published tests, designed 
for use in schools, we find a wide variety of ostensibly 
different kinds: as well as tests of achievement, at all 
levels, in most of the subjects in the curriculum, there 
are tests purporting to measure intelligence, abilities, 
aptitudes, interests, attitudes and many other attributes. 
Furthermore, within each of these—and other—categories 
there is a range of tests, each, it would seem, designed 
to examine a specific component or aspect of the general 
characteristics concerned. This would seem to suggest that 
educational measurement is concerned with an array of 
independent entities, for each of which there are distinc- 
tive outward and visible signs that can be detected by an 
appropriate test. , 

An tein tal of the tests themselves serves to disabuse 
Us of this comfortable notion. On inspection it becomes 
apparent that the contents of the tests are not nearly 


SO varied as their titles might lead us to expect. man 


VARIETIES OF MEASUREMENT 


if particular items are encountered outside their context, 
it is often difficult to assign them to their appropriate 
category—to decide, for example, whether they have 
been taken from a test of intelligence, attainment, aptitude 
or possibly of interests. (The authors some time ago devised 
a parlour-game along these lines which proved to be 
challenging enough to afford entertainment to groups of 
experienced teachers.) 

This situation has not arisen because test constructors 
are clumsily inefficient and thus fall into the error of 
placing in one test items that properly belong to another. 
It is determined, rather, by the nature of the relation- 
ships among the phenomena with which educational and 
psychological measurements have to deal. 

Perhaps the first step towards an understanding of the 
problems that have to be solved if educational measure- 
ment is to be effective is to recognise that the objects 
and events with which it is primarily concerned cannot 
be directly observed. A teacher cannot stand back, at the 
end of a lesson or series of lessons, and admire his handi- 
work in the way that a sculptor may contemplate the 
changes he has made in the shape of his material. Many of 
the modifications that a teacher seeks to introduce— 
changes in his pupils’ insights, capabilities, attitudes and 
interests—are not open to direct inspection. 

Nevertheless, although we cannot directly observe the 
changes that result from the processes of maturation and 
learning, we are often justified in inferring that they have 
taken place. This is because we can note modifications 
in a person’s behaviour or performance. These latter mani- 
festations—the things that people say and do—are the 
objects and events with which educational measurement 
deals. We use people's performance—and particularly the 
ways in which they employ pens and pencils within con- 
trived situations—as evidence on which we base judg 
ments and predictions of various kinds. 

This latter point deserves to be emphasised. We are 
rarely concerned about the performance per se, but rather 


ng what it reveals to us about the individual concerned. 
4 


VARIETIES OF MEASUREMENT 


Setting a test or examination is a means to an end—the 
collection of evidence which will enable us to estimate 
a person’s qualities and capabilities and, in the majority 
of instances, our purpose is to make a prediction. We 
are usually seeking grounds on which to base a forecast 
about the way in which a person is likely to respond in 
the future in a foreseeable set of circumstances. 

The word ‘performance’, of course, is a global term 
comprising all the overt activities and responses of which 
an individual is capable. In devising an educational test 
we confine our attention to some limited range of responses 
which will enable us to form the judgment or make the 
prediction that we require. This involves us in some 
system of classification. We need to assign each response 
that might be elicited to a class or category- And, as is 
true of any system of classification, the categories we 
choose will be those that best serve our particular pur- 
poses. It is also convenient to label the categories. When 
we refer to ‘intelligence’, abilities of various kinds, apti- 
tudes, attitudes and so on, we are in fact referring to 
categories of responses or performances that we have 
seen fit to establish—and not, as We emphasised earlier, 
to recognisable mental entities. 

It may be helpful and perhaps instructive to compare 
the process of classifying performances with that of 
Organising an office filing-system. Stacks of unsorted 
letters are unmanageable and the interests of order and 
efficiency are served if related letters—those dealing with 
a particular topic for example—are segregated into clearly 
labelled boxes or folders. As anyone who has worked in an 
Office realises, however, there is no single filing system 
which adequately serves every purpose. In one establish- 
ment it may be sensible and convenient to file all letters 
from abroad in one folder. In another, mainly concerned 
perhaps with exports, such a simple procedure would 
Prove unworkable and it would be preferable to have 
at least one separate file for each country from which the 
office had correspondence. Thus it is possible, and some- 


Umes necessary, to organise two OY more different a 
5 


VARIETIES OF MEASUREMENT 


systems even when dealing with the same kind of material. 
It may also be necessary to arrange for cross-references 
(that is, arranging for copies of the same letter to appear 
in several different folders). It is equally essential, if 
frustration is to be avoided, that whatever filing systems are 
employed should have some logical justification and 
should be comprehensible to those who have to use them. 
All these considerations are recognisably applicable to 
the classification of performances. Clearly it is possible— 
and may be desirable—to adopt different classifications 
for different purposes. And whatever system we adopt 
something akin to cross-reference will probably be essen- 
tial. (We have already noted, as evidence of this, that 
comparable test items tend to be found in tests designed 
for markedly different purposes.) Finally it is desirable 


that any system of classification should have some logically 
defensible basis, 


The structure of abilities 


We used to be content with an arbitrary classification of 
human activities, based on uncontrolled observations. The 
procedure was to contemplate human behaviour from 
Some convenient vantage point—an arm-chair for 
example—to distinguish categories of performance and 
to attribute each of these to the operation of a separate 
mental power or ‘faculty’. This approach fell into dis- 
repute towards the end of the last century and was 
superseded by the use of more scientific methods, designed 
to yield objective evidence concerning the relationships 
obtaining among different kinds of performance. The tools 
of this new trade were tests, which provided the means 
for assessing performance quantitatively, and various statis- 
tical devices, notably those of correlation and factor 
analysis, which made it Possible to observe the extent tO 
which performances are interrelated. 

At the end of this chapter we indicate sources of infor- 
mation about these techniques and the results they have 


PERE, All that we Propose to attempt here is to explain 


VARIETIES OF MEASUREMENT 


the rationale of the methods currently employed to deter- 
mine the structure of human abilities. 

Let us suppose that, after observing some aspects of 
human performance, we believe that we have detected six 
distinctive categories of performance which we label A, 
B, C, D, E and F. We then proceed to devise suitable 
tests for each of these categories so that we can obtain 
quantitative assessments of the performances of groups 
of people in each of them. Having thus obtained six sets 
of scores we then calculate the correlations between each 
set and the other five. If we were to find that performances 
A, B, C were highly intercorrelated, that D, E and F also 
showed high correlations among themselves, but that every 
member of the first group showed low correlations with 
every member of the second, what conclusions would we 
draw? Clearly we should be justified in deciding that we 
Were confronted not with six distinguishable categories 
of performance but with only two. Tests A, B, and C 
are all concerned with one of these, and tests D, E and F 
with the other. This, roughly, is the object of the exercise 
involved in applying factor analysis to test scores. We 
are looking for clusters of related performances. And 
each of these recognisable clusters—which are objectively 
determined categories—we designate as an ability. 

The large-scale application of this approach has enabled 
us to identify a range of abilities and to recognise the 
relationships that they exhibit. A system of classification 
that is now widely adopted as being consonant with the 
evidence afforded by factor-analytical studies involves a 
hierarchical structure of abilities. Near the beginning of 
the century Professor Charles Spearman was the first to 
note the tendency for the scores of all tests involving intel- 
lectual activity to be positively correlated. This led him 
to postulate that there must be a general factor, which 
he labelled ‘g', entering into all such activities. Since 
then it has been found that there are, in addition to the 
general factor, ‘group’ factors which enter into some 
groups of tests but not others. We have identified for 
example a broad group of performances labelled TE 


VARIETIES OF MEASUREMENT. 


or verbal-educational skills, which comprise verbal 
ability (all performances involving the use and compre- 
hension of verbal symbols) and numerical ability (perform- 
ances involving the use and comprehension of numerical 
symbols). Another distinctive category involves the ‘k:m’ 
group. The symbol ‘k’ is used to designate spatial ability— 
the ability to comprehend spatial relationships—and ‘m’ 
indicates the ability to understand mechanical relation- 
ships. 

It is usual to represent this system of classification, in 


order to represent its hierarchial structure, in the form of 
a family tree (Fig. XX). 


general factor (g) 


verbal-educational group spatial-mechanical group 
(ved) (k:m) 


verbal (y) 


numerical (n) spatial (k) mechanical (m) 


I EN pts in 


Fig. XX 


It is possible, if performances are analysed closely (that 
is, if many different kinds of test are used) to divide each 
broad group factor into minor factors and to continue 
the sub-division until one discovers performances that 
are specific to one particular test or task. 

We recognise that the above shorthand note, covering 
as it does a complex topic, will not be immediately mean- 
ingful. The references at the end of the chapter will enable 
the reader to explore further in this field and to discover 
for himself the significance of these categories and the 
ways in which they have been established. Our major 
concern at this stage is to try to indicate what is involve 


A tests of abilities. They are tests which sample objec- 


VARIETIES OF MEASUREMENT 


tively determined categories of performance. For example, 
group tests of verbal reasoning or ‘intelligence’ as they are 
commonly called, are concerned with the broad range 
of skills that are shown in the left side of the diagram 
above—involving, that is, the ‘g’ and ‘v:ed’ factors. This 
area has been selected (and the spatial-mechanical group 
of abilities excluded) because it contains the skills that 
are of especial importance in scholastic work: thus a 
test which adequately samples these skills provides a 
serviceable prediction of the extent to which an individual 
is likely to succeed in an academic course. Other tests of 
ability are aimed at more restricted areas of performance: 
tests of verbal ability, spatial ability, etc. Tests of ability 
may also be devised to measure fairly narrow sub-divisions 
of these broad group factors. We may, for example, aim 
to measure one particular aspect of verbal ability—verbal 
fluency for example. It should be pointed out, however, 
that because the structure of abilities assumes the hierarch- 
ical character that we have indicated one does not obtain 
‘pure’ measures of any particular ability. Every test will 
to some extent measure ‘g’ as well as whatever individual 
ability it is concerned with. Perhaps the best way to under- 
stand this is by analogy with the situation we commonly 
encounter in school examinations. In discussing essay- 
type examinations earlier we made the point that all 
such examinations inevitably assess performance in English 
to some extent. Thus, a history examination samples not 
only attainments in that subject but also the extent to 
which an individual is able to comprehend and utilise the 
English language. Similarly, ‘g’ or general ability inevitably 
enters into all the cognitive tests that we devise no matter 
to what area of performance we are addressing our atten- 
tion. 


Attainments and aptitudes 


We save seen that abilities are distinguished by classifying 
Performances according to the extent to which they are 
manifestly related. Attainments consist of somewhat 

1EM—G 89 


VARIETIES OF MEASUREMENT 


it ies: i ivin, 
different and mainly traditional =. a = 
at these we are still classifying perform e d. 
what individuals say and do—and we are stil en sees 
with the relationships exhibited by these per: nape 
but we adopt different criteria in carrying out me —" 
cation. In classifying performances into abi ook ne 
emphasise the kinship of the performances po peta = 
classifying them into attainments we place t e Sia 
the objects or phenomena with which these perfo: aie 
are concerned. For example, we distinguish between a 
ments in foreign languages—French, German, Span’ a 
etc——not in terms of the skills involved (these as 
obviously much the same in each case), but aig i et 
respect to the nature of the material on ee an 
skills operate, so to speak. Attainments are largely Saks 
ventional categories, which can be combined or bed 
divided arbitrarily: we may elect, for example, to cits 
chemistry as a ‘subject’ or we may distinguish berme s 
organic and inorganic chemistry; similarly, we sae 
choose to combine history, geography, and perhap 
branches of other subjects to provide the broader category 
of ‘social studies’. ; - 
The point that we wish to emphasise is that the aoe 
between abilities and attainments arises simply from ae. 
application to the same objects and events of two di 
ferent systems of classification. Basically, therefore, mM 
i iliti i ‘ ling 
testing abilities and attainments we are in effect samp his 
the same population of acquired skills. By recognising t A 
We can avoid a good deal of the confusion that has oe 
in the past concerning the relationship between the two- 
It has sometimes been assumed that in measuring ability 
We are assessing ‘capacity’ or ‘potential’, whereas 7 
measuring attainment we are estimating the extent A 
Which this capacity or potential has in fact been realised, 
The truth is in each instance we are examining a samp i 
of performance—that is, we are discovering what, 4 


a given moment, an individual says or does when con- 
fronted with a particular task or problem. 
We are entitled, of 


a r ke 
course, to use this evidence to ma 
90 


VARIETIES OF MEASUREMENT 


predictions. We may say that because, here and now, a 
person demonstrates that he can successfully perform 
tasks X and Y, we are reasonably confident that when he 
encounters tasks X and Y in the future he will acquit 
himself creditably. We may even go further and forecast 
that he will also prove capable of tackling A, B and G 
successfully on the grounds that we have observed a 
tendency for people who can tackle X and Y to be able 
also to cope with A, B and C. In other words, we can 
use our test as one of capacity or potential, although in 
effect it can properly be described only as one of present 
performance. 

Similarly, we may use our test of present performance 
to make retrospective judgments. Because an individual 
shows now that he can perform X and Y we are entitled 
to infer that he must have practised these skills assid- 
uously or that he has been competently taught. 

A test of aptitude is one designed to provide an indica- 
tion of the extent to which an individual is likely to 
succeed when he tackles some subject, course, or activity 
that he has not previously encountered. From the fore- 
going discussion it is clear that such a test does not 
involve objects and events different from those that we 
considered when we seek estimates of ability or attain- 
ment. In devising an aptitude test we have to determine the 
kinds of present performance that are likely to enable us 
to forecast future success in the activity under review. 
We may decide that particular abilities are likely to be 
called into play in the subject or course we are considering 
Or that certain aspects of attainment constitute a relevant 
consideration. 

What we are suggesting, in fact, may be summed up as 
follows : 


(1) All that our educational tests measure—and, indeed, 
all that they can measure—is present performance: what 
an individual says and does here and now when confronted 
With particular tasks. 


(2) Whatever area of performance we select from scrutiny 
91 


VARIETIES OF MEASUREMENT 


can of course only be sampled: we cannot devise a test 
comprehensive enough exhaustively to assess proficiency 
in even the narrowest area of performance. Thus, our 
measures enable us to draw inferences about the kinds 
of performance in which we are interested and never 
furnish conclusive evidence about an individual’s capabili- 
ties. (If this is not immediately clear, consider the practice 
of measuring proficiency in simple arithmetic. The number 
of addition, subtraction, etc., problems that could be 
devised are far too large to include in any single test. 
What we do is to rely on a small sample of the total 
population of possible items and to assume that if a 
person can deal successfully with these he would be 


likely also to deal successfully with other comparable 
items.) 


(3) The content of a test i 
classification that we a 
turn, depends on the pu 
serve. Thus, 
ability—spati 
be drawn fro: 


s determined by the system of 
pply to performances which, in 
rposes that the test is designed 
to qualify for inclusion within a test ©} 
al, verbal, numerical, etc., performances may 
m both ability and attainment categories: the 
characteristic that they must share is that they furn- 
ish evidence on which an appropriate prediction may be 
based. 


We can now resoly 
referred at the begin: 
tests designed to se 


e the apparent paradox to which we 
ning of this chapter. We noted tha 


rve different purposes often employ 
comparable items and that it is often difficult, when a 


particular item is encountered outside its context to decide 
whether it Properly belongs to a test of ability, attain- 
ment or aptitude. Although we hope that the reason for 
this has emerged from the foregoing discussion, perhaps 
the point can be further clarified by an example. suppos 
that we examine the contents of four tests: a test Ei 
‘intelligence’ or verbal reasoning; a test of numerica’ 
ability; a test of attainment in elementary mathematics: 
a test of aptitude for an advanced course in mathematics: 


The first test must sample that broad range of skills that 
92 


VARIETIES OF MEASUREMENT 


are contained within the g and v:ed categories as estab- 
lished by the results of factor-analytic studies. The second 
involves a more restricted range of performances but since 
these constitute one component of the v:ed group, the 
two tests must inevitably overlap and must therefore 
utilise comparable items. We are concerned in the third 
test to estimate the extent to which an individual has 
developed some of the skills sampled by the first two 
tests and their application to the curriculum of an elemen- 
tary mathematics course. Again it is feasible that some 
of the items common to the first two tests will adequately 
serve the purposes of the third. Finally we may well 
decide that the prediction of success in an advanced 
Mathematics course requires evidence of (a) general 
scholastic ability, (b) the ability to comprehend and mani- 
pulate numerical symbols, and (c) a satisfactory level of 
attainment in an elementary mathematics course. The 
aptitude test therefore would bear some resemblance to 
each of the other three tests and, conceivably, some items 
might well prove to be serviceable in all four tests. 


Interests and attitudes 


The performances that we have discussed so far—those that 
may be subsumed under such headings as ability, attain- 
ment and aptitude—represent only a small fraction of the 
total array of activities of which human beings are capable. 
Furthermore, the processes of learning and problem-solving, 
with which the kinds of tests that we have been consider- 
ing are concerned, do not occur in isolation. They are pre- 
ceded and accompanied by and, in turn serve to modify 
the total personality of the individuals concerned. Our 
attempts to provide effective educational guidance would 
be fruitless, therefore, if we confined our attention to our 
Pupils’ intelligence, attainments, aptitudes and the like 
and failed to take account of the motives that prompt 
their activities and the emotions that these activities 
engender. An adequate discussion of the complex problems 


of personality assessments cannot of course be under- 
93 


VARIETIES OF MEASUREMENT 


taken in a short introductory text such as this and we pro- 
pose to refer, briefly, to only two kinds of test in this field 
—those concerned with interests and attitudes. 

Teachers must obviously concern themselves with the 
nature and range of their pupils’ interests. The strength 
of a child’s interest in a school subject, for example is an 
item of evidence which will help us both to judge the effec- 
tiveness of a course of instruction and also to predict the 
levels of attainment that the child is likely to reach in 
the future. By an interest we mean a tendency or disposi- 
tion to pay attention to particular phenomena or to select 
a given activity when a choice is available. Again we can 
only infer that such a disposition exists by observing per- 
formance. We may confront children with a choice of 
activities on a number of occasions and note the extent to 
which they consistently make the same selection. Or we 
May arrange for a record to be kept of their spare-time 
activities, noting the frequency with which they spon- 
taneously elect to indulge in a particular pursuit. 

Tests of interest often bear a resemblance to tests of 
attainment in that they investigate the extent to which 
children have acquired particular kinds of knowledge and/ 
or information. Clearly knowledge and interest are reci- 
procally related in that a strong interest in a topic will 
induce a person to acquire knowledge about it and this 
additional knowledge will serve further to stimulate his 
interest. If a test is designed to test knowledge that has 
been gleaned outside the classroom it might well provide 
even more reliable evidence of the existence of a strong in- 
terest. 

Attitudes are akin to interests in that they refer to ten- 
dencies or dispositions to react in a particular way towards 
certain phenomena or aspects of the environment. They 
indicate, however, a somewhat broader concept. An atti- 
tude is a complex of cognitive and emotional dispositions— 
the tendency to hold certain beliefs about and to feel in 
certain ways towards objects, persons or ideas. Our educa- 
tional aims include the attempt to foster particular atti- 


tudes, not only towards the activities with which we are 
94 


VARIETIES OF MEASUREMENT 


directly concerned (favourable attitudes towards academic 
pursuits, etc.), but also towards aspects of the world at 
large (tolerant attitudes towards racial minorities for ex- 
ample). 

We are unlikely to be provided with an opportunity 
to observe directly the behavioural tendencies that reveal 
many of the attitudes in which we are interested and we 
usually have to rely on an individual's verbal reports which 
help us to draw inferences about the attitudes that he has 
developed, Attitude ‘scales’ or tests tend to consist of a 
series of statements, each representing some point on a 
continuum from, for example, an extremely favourable to 
an extremely unfavourable attitude towards some phen- 
omenon. An individual is then invited to indicate the state- 
ments to which he can subscribe and his position on the 
scale can be calculated. A scale designed to assess attitudes 
towards mathematics might include such statements as+ 
I sometimes enjoy mathematics but on the whole prefer 
other school subjects; I would be extremely happy if I were 
told that I would never have to do mathematics again; I am 
never happier than when I am working on a mathematical 
problem; I feel relatively indifferent towards mathematics 
—TI neither like nor dislike it very strongly. A child's 
endorsement of such statements (if we could be satisfied 
that he was being truthful) would help us to assess his 
attitude in this respect. 

Teachers are unlikely to use attitudes scales as such, 
but a familiarity with established scales would help them 
to determine the kind of evidence that is relevant to the 
assessment of particular attitudes. They could then look 
for and note this evidence when observing their pupil's be- 
haviour, listening to their discussions OY reading their 
essays. 

This latter approach might be valuable to a teacher not 
only as a means of improving the quality of the assessments 
that he may wish to make of his pupils’ personality traits 
but also in devising tests of attainments, aptitudes, etc. We 
have seen that the key problem in educational measure- 
ment is that of determining the kind of evidence that is 

95 


VARIETIES OF MEASUREMENT 


relevant for our purposes—to enable us to form a judgment 
or make a prediction. 

The study of published tests can help us to discover the 
Kinds of performance that can be appropriately sampled 
for a particular purpose. This is particularly true if the 
test we examine furnishes satisfactory indications of its 
validity. In other words if a test demonstrably serves a 
given purpose—for example it has been found to afford a 
dependable prediction of success in a specified course—it 
will provide us with a useful guide when we come to de- 
vise a test of our own for a comparable purpose. 


96 


8 


The problem of moderation 


School-based examinations 


We have concentrated so far on the application of the prin- 
ciples and techniques of educational measurement to the 
design and use of internal examinations. We suggested at 
the outset, however, that in the future teachers might be- 
come involved in examining on a broader front. In some 
areas provision is already made for the award of the Certi- 
ficate of Secondary Education to be based on examinations 
that are devised, and assessed by teachers within their own 
schools. If this proves to be a viable procedure it is con- 
ceivable that other externally administered examinations 
might eventually become similarly school-based. Such 
arrangements call for a process of ‘moderation’. It is with 
the problems involved in moderation and ways in which 
these might be solved that this chapter is concerned. , 

The essence of the problem can perhaps be most readily 
appreciated by envisaging the outcome of a series of school- 
based examinations. Let us suppose that it has been agreed 
to award one of five grades to each candidate. Grade ‘A’ is 
to be awarded to those who demonstrate an unusual degree 
of merit; ‘B’ to those whose performance is manifestly 
above average; ‘C’ to the average run of examinees; 
‘D’ to those who perform somewhat below this level 
and ʻE to those who barely satisfy the minimal require- 
ments. 

If without further ado we were to leave the teachers 
in each school to determine the appropriate grades for 


their pupils it is scarcely feasible that these grades would 
97 


THE PROBLEM OF MODERATION 


prove to be empara over the country as a whole or 
ithin a limited area. 
e Ng ea of a moderator* to take steps to eme 
l results are, as far as possible, fair 

that the eventua ! , ata 
all concerned. To achieve this end he must ist Hoa 
when confronted with variations from one schoo Phe 
other in the ways in which the grades are pone ter, 
between those variations which may be ascribed R a 
ence in the competence of the pupils and those whic ray 
be due to the idiosyncrasies of their examiners. ie 5 
ample, one school may furnish results which ingia a 
preponderance of ‘A’ and ‘B’ grades, whereas another 
awards average and below-average grades to a HE fs 
proportion of its pupils. The question to be aa k 
whether this is a true reflection of the relative leve e 
performance of the two sets of pupils, or whether in A 
former school the teachers have been unduly lenient an : 
those in the latter excessively severe in making their ee 
ments. Alternatively one could conceive of a — 
in which two schools furnish distributions which oa 
identical—the same Proportions of pupils in each a 
being awarded a particular grade—and which conform i 
the general distribution for the area as a whole. It may we 

be, however, that one of these schools accommodates an 
unusually effective group of pupils whose superior achieve- 
ments are not adequately reflected in their teachers aa 
ments. Thus it is the duty of the moderator to sanay 
himself that the grades finally awarded adequately matc 

the levels of achievement of the pupils concerned. - 

Although the problem can be most easily ap aie 
by envisaging the outcome of the examinations, it does no 
follow that moderation is a wholly retrospective exercise- 
To be effective a moderator must be involved from the -. 
set. He needs to concern himself not only with the in 
results but with the ways in which the courses, to whic 
these results are related, are initially planned. 

* We refer to ʻa moderato: 
practice, the fi 
sibility of gro 
98 


i In 
r’ for the sake of convene 
unctions we are discussing would be the resp 
ups or panels of teachers. 


THE PROBLEM OF MODERATION 


For example it is conceivable and, indeed probable, that 
teachers will disagree about the purposes that should be 
served by the courses they offer. Furthermore this dis- 
agreement may be justifiable in that individual differences 
among pupils may demand that there should be a variety of 
aims even within the same subject areas. Teaching mathe- 
matics, for instance, may be conceived in one school as a 
part of the general education of pupils who are likely 
to make only limited applications of their knowledge in 
their everyday lives, whereas, in another, it may be re- 
garded as part of a necessary vocational preparation for 
pupils who are intending to become scientists or technolo- 
gists. These two courses are likely to differ markedly in 
content and at the extremes one child may have been in- 
vited to master little more than the conventional arith- 
metical operations and their application to a series of every- 
day situations; another may have been introduced to the 
calculus and the theory of probability. . 

If the appropriate examinations are set to assess attain- 
ments in each of them the resultant grades cannot be com- 
pared in any acceptable sense. : 

n Nor is this a situation which can be remedied by a tidy- 
ing-up process after the results have been determined. The 
problem here is not one of being confronted with two dif- 
ferent scales of marks but rather that of dealing with 
what are, in effect examinations in wholly different sub- 
jects even though they share the same general title. This 
aspect of moderation, involving a decision about the com- 
Parability of courses as distinct from grades, is entirely 
a question of judgment which can be aided little if at all 
by statistical devices. r 

In many instances, of course, experienced teachers will 
be able to reach agreement about acceptability of different 
Courses. These may differ in content and distribution of 
emphasis but they may be regarded as being of equal 
academic merit. This being so we may turn to the next 
responsibility that a moderator must assume. Clearly he 
must concern himself with the design and conduct of the 


examinations themselves. He must use his influence = 


THE PROBLEM OF MODERATION 


ensure that the processes of assessment are carried out 
ith maximal efficiency. 

MWe need not dwell on this aspect of the moderator’s func- 
tion since we have outlined in previous chapters the ways 
in which the validity and reliability of examinations may 
be improved. Validity largely depends on the sae es 
dence between the examination and the objectives of the 
course and on an adequate sampling of its contents in terms 
of the specific skills and items of knowledge that are inten- 
ded to be the end product of the instruction offered. Relia- 
bility is best ensured by empirical means—by testing out 
the items or questions in the examination and feeding 
back the information so obtained to those responsible for 
drafting future papers, 

Assuming that care has been taken in these ways to 
ensure that the examinations are as efficient as possible we 
must now turn to what is generally regarded as the moder- 
ator’s main function—that of making sense of the marks 
or grades that are eventually supplied. 


The process of moderation 


Moderation in this se 
been a feature of ex 
different levels for 
moderation has bee 


nse is no new phenomenon. It has 
aminations of various kinds and = 
many years. For example a form zi 
n practised regularly by some loca 
education authorities with respect to the eleven-plus €X- 
amination. Most education authorities wished to make use, 
as part of their procedures for determining the suitability 
of candidates for different kinds of secondary education, 
of the fact that primary school teachers were clearly well 
informed about the relative merits of their pupils. It offen- 
ded against common sense to base assessments of children’s 
promise and attainments on performance in a single exam- 
ination. Clearly a more valid judgment would be obtained 
if the teacher’s estimates of each child’s achievements and 
capacities could also be obtained. The obvious snag, of 
course, was that whereas a teacher could readily and effec- 


tively arrange his own Pupils in rank order with respect 
100 


THE PROBLEM OF MODERATION 


to their suitability for a grammar school course he could 
not be expected to compare them in this or any other res- 
pect with children whom he had not encountered. Thus the 
authorities were faced with a seemingly insuperable 
obstacle. They could avail themselves of expert judgments 
concerning the relative merit of children within a particu- 
lar schooi but there appeared to be no means of comparing 
children from different schools in their area—except of 
course on the basis of their performance in the authority's 
examination. A simple device—that of scaling—afforded 
a solution to this problem and enabled authorities to make 
effective use of teachers’ judgments in the eleven-plus 
procedure. This involved the use of a common measuring 
instrument—usually a verbal ability test—to enable the au- 
thority to make due allowance for the differences between 
schools in terms of scholastic skills of the pupils attending 
them. By securing an objective assessment of the level of 
these schools in a given area they were able to adjust or 
moderate’ the teachers’ judgments accordingly. (A detailed 
account of the ways in which this scaling procedure was 
applied and of the results it achieved is provided in the pre- 
sent authors’ book Admission to Grammar Schools.) 

One form of moderation is closely similar to the method 
of scaling we have just discussed. This involves the use for 
Scaling or moderating purposes of a common measuring in- 
strument. This may be of a general nature—such as a test 
of scholastic aptitude—or it may consist of ‘Core Papers’ as 
they are sometimes called. For example, suppose that we 
are considering the moderation of C.S.E, examinations in 
geography in a situation in which each school has been left 
free to devise its own syllabus and to design its own exam- 
ination papers in order to assess the performance of the 
pupils at the end of the course. Thus the schools concerned 
may choose different regions of the world for special 
Study and, in other respects too, they may differ—in, for 
example, the ways in which they select particular topics 
for detailed treatments. Nevertheless, they may be open 
to persuasion that there are certain essential features of 


geographical study—some aspects of physical geography 
IOI 


THE PROBLEM OF MODERATION 


or the interpretation of maps for example—which every 
pupil in a secondary school course should encounter. If so, 
it should be possible to draft a ‘Core Paper'—an examina- 
tion of the common essentials or agreed syllabus, so to 
speak—which could be set across all schools. Such an ex- 
amination could serve the same purpose as the ability test 
in the illustration we have discussed, That is, it could be 
employed as a scaling instrument to adjust each school’s 


marks or grades, awarded as a result of their own exam- 
inations. 


Moderation without a scaling instrument 


It may not be possible or not be thought desirable to intro- 
duce core papers of the kind suggested. The alternative is 
for a panel of moderators to scrutinise each school’s assess- 
ments and to adjust them where necessary as a result of 
their own investigation of the pupils’ examination scripts. 
The principle here is that of safety in numbers. The more 
people who agree about the grade to be awarded to a parti- 
cular performance the more valid the assessment is likely 
to be, especially if the moderators involved in this exer- 
cise are chosen because they are demonstrably experienced 
and trustworthy examiners. 

The kinds of fault that moderators might need to remedy 
n this procedure fall under three main heads. 


d accordingly mark them differently. Yet 4 
t are over-refined ani 
or penalise candidates 


THE PROBLEM OF MODERATION 


for what other examiners would be inclined to regard as 
trivial differences. 

Conformity. The third major difference that examiners 
might betray is disagreement about the relative merits of 
the candidates. Two examiners might adopt broadly similar 
standards and employ equivalent degrees of discrimina- 
tion but might nevertheless be disposed to place the same 
group of pupils in somewhat different rank orders. This 
could occur if, for example, they disagreed about the im- 
portance of various aspects of the curriculum and were 
disposed to react differently to the manifestation of particu- 
lar kinds of skill or knowledge. 


In scrutinising the assessment furnished by a particular 
school moderators would need to pay attention therefore 
to these three attributes: the standard of marking; the 
degree of discrimination; and the extent to which the ex- 
aminers have conformed to what the moderators regard 
as an appropriate order of merit. r 
__In each instance the judgment of the moderators furn- 
ishes the criterion against which the adequacy of a school’s 
grades is assessed and for this reason it is clear that the 
choice and, where necessary, the training of suitable 
moderators are crucial to the success of the enterprise. 

Standards of examining may be compared by relating 
the mean mark or grade supplied by the school to that 
determined independently by the moderators. Some dif- 
ferences are of course to be expected, since each mean is, 
in effect, only a sample of the means that would be derived 
if the assessments were to be repeated on a number of occa- 
sions. The critical decision that a moderator has to make in 
this respect concerns the size of the difference that may 
be tolerated, To arrive at this decision he needs to apply 
techniques, to which we have already referred, for 
determining the standard error of the observed difference. 
In other words he must be able to identify the kind of dis- 
crepancy that is statistically significant. Differences in dis- 
criminations will be revealed by differences in the spread 


of the marks or grades. We have already discussed the use 
103 


THE PROBLEM OF MODERATION 


of the standard deviation as a means of measuring sa 
spread of a set of marks and this, therefore (or some ae : 
comparable measure) would enable a moderator wh g 
the appropriateness of a school’s grades in this — ; vað 
Finally, agreement in terms of rank order can be c ies a 
by examining the correlation between the Eeti s = 
marks or grades. A high level of correlation would in i. 
a substantial measure of agreement. After making thi 
comparison a moderator must of course face a guesa 
to which there is no clear-cut answer: how high is a hig 
correlation? All that one can say on this score is that if he 
demanded a correlation in excess of 0.9 he would be in 
most instances, doomed to disappointment. On the other 
hand, if he allowed a correlation of less than 0.6 to pass 


muster, he would be accepting a considerable degree of 
non-conformity, 


A simplified procedure 


It will be readily a 
the teachers conc 
in fairly complex 
would they be req 
ations and correla 
tions in order to es 
that emerged. 


Fortunately a somewhat sim 
evolved for the cond 


ppreciated that to carry out these tasks 
erned would find themselves involved 
and unfamiliar computations. Not only 
uired to calculate means, standard devi- 
tions but also to make further calcula- 
timate the significance of any differences 


plified procedure has been 

uct of local moderation—that is, the 
moderation, by a panel of teachers, of the marks or grades 
submitted by a small group of schools in the immediate 
vicinity. This is outlined in the Schools Council's Examina- 
tions Bulletin No. 5. We shall concern ourselves only with 
the general features of the Proposed method; for informa- 


tion concerning its detailed application the reader should 
consult the Bulletin itself. 


The object of this Suggested procedure is to cut down 
the work of the mode 


Tators in two ways: by reducing 
the load of re-marking that they are required to undertakes 
and by simplifying the Statistical work involved. 

104 


THE PROBLEM OF MODERATION 


The first of these economies is effected by relying on 
sampling. In ideal circumstances one would arrange for 
each script to be re-marked successively by several moder- 
ators. If a sizeable number of competent and experienced 
teachers were, independently, to mark all the papers, their 
pooled results would clearly yield a highly dependable 
assessment, Individual errors and differences would be 
ironed out and the eventual mean score could be regarded 
as not significantly different from the hypothetical ‘true’ 
mark or grade. Such procedure, however, would be too 
costly in time and labour to be practicable. And, indeed, 
it is demonstrably unnecessary if, instead of arranging for 
a group of moderators to mark all the scripts, each is 
required to mark a sample. The resultant means, it has been 
shown, do not differ significantly from those derived from 
an assessment of the total array of scripts. Needless to say, 
this is true only if the sampling is a strictly random one. 
_ This then is one of the ways in which the labour involved 
in moderation can be cut down. In extensive trials it has 
been demonstrated that a team of moderators can arrive 
at satisfactory results by confining their attention to a 
sample of no more than twenty scripts. 

Another saving is effected in this proposed scheme by 
using ‘range estimates’ instead of standard deviations. The 
range estimate is a measure of spread which is extensively 
used in certain statistical procedures applied to industrial 
processes but one which has never before been introduced 
into educational measurement. Again it has shown empiric- 
ally that it yields a measure sufficiently accurate for the 
purposes of moderation and involves nothing more than 
simple addition and subtraction as against the laborious 
and time-consuming processes (calculating sums of squares 
and square roots) that are involved in computing standard 
deviations. 

Finally a further substantial economy has been found 
possible as a result of experiments involving carefully 
chosen teams of moderators. In these experiments the 
moderators conducted trials designed to discover the ex- 
tent to which they could secure agreement amongst them- 

IEM—H 105 


THE PROBLEM OF MODERATION 


selves under the three headings we discussed earlier: 
standards; discrimination, and conformity. 

Each moderator marked a sample of scripts and the 
results were then analysed with respect to these three attri- 
butes of their performance. Numerous trials of this kind re- 
vealed that disagreement was very rarely found under 
the latter two heads. Moderators thus tend to discriminate 
equally effectively (that is, they use the full range of marks 
where appropriate) and they conform (that is, their results 
tend to intercorrelate highly). They do tend to differ, 
however, with respect to the standards they adopt, rang- 
ing from severe to lenient. One suggested way of dealing 
with this problem is as follows. 

First the mean mark or grade for all moderators is cal- 
culated and the significance of any differences between an 
individual moderator’s mean and the general mean is 
worked out. The moderators can then be arranged in order 
of merit, so to speak, from the most severe to the most 
lenient. Suppose that the following result emerged from a 
preliminary trial involving twelve moderators: eight of 
them produce means that do not differ significantly from 
the group mean. This implies that these eight can be en- 
trusted forthwith to mark on behalf of the group. In other 
words it has been demonstrated that when any of these 
eight people mark papers they achieve a result reasonably 
near to that which would be obtained if all the moderators 
marked all the scripts and pooled their assessments. 

Furthermore it has been found possible to make use of 
those who fall outside the limits—that is, those whose 
means differ significantly from the group mean—by 
arranging for them to operate in pairs, If the severest 
marker is teamed up with the most lenient, the next most 
severe with the next most lenient, and so on, and if each 
pair produces an averaged assessment it is not altogether 
Surprising that the results are found to fall within the 
required range. Thus in the circumstances described, out © 
a team of twelve moderators it would be possible tO 


organise ten acceptable marking units, so to speak—eight 
individial markers and two pairs. The total number © 
I 


THE PROBLEM OF MODERALION 


scripts to be dealt with could therefore safely be divided 
into ten batches and assigned to these units for assessment. 
Moderation at regional and national levels—that is, 
seeking to ensure that marks or grades are comparable not 
only among a number of neighbouring schools but also 
over the country as a whole—inevitably requires proce- 
dures that are more complex than those we have just dis- 
cussed. Nevertheless the essence of the problem remains the 
same. The object of moderation, at every level, is to maxi- 
mise the validity of the final grades that are awarded. 


Some current experiments 


By way of a postcript to this chapter we should like to refer 
briefly to some current experiments designed to examine 
the possibility of conducting school-based examinations 
from the results of which valid grades could be determined 
without the need for subsequent processes of moderation. 

The first step in the procedure that is at present being 
empirically tested (by the National Foundation for Educa- 
tion Research under the aegis of the Schools Council) is 
for the teachers concerned with a particular examination 
to draw up a blue-print, along the lines indicated in Chapter 
3, of the kind of examination they require. They would 
then forward details of their plans, including the specific 
objectives to be measured and, possibly, a number of pro- 
posed items, to a central organisation, staffed by experts 
in test item construction and analysis. This organisation 
would arrange for the items suggested by the teachers and 
others designed by its own staff members to assess the 
specified objectives, to be administered to nationally repre- 
sentative samples of pupils. 

An item analysis of the results of this trial would thus 
furnish information which would establish the national 
currency, so to speak, of the items included. 

The teachers concerned would receive back from the 
central organisation a ‘bank’ of tested and approved items 
from which they could select a set that was tailor-made, 


as it were, to suit the objectives they have stipulated. 
107 


THE PROBLEM OF MODERATION 


Furthermore, when the test or examination has been given 
it would be possible to determine, since each item has been 
tried out on a nationally representative sample, the levels 
of score appropriate to each CS.E. grade. 

The preliminary results of experiments along these lines 
suggest that this might well prove to be viable procedure. 
If so, we might reasonably look forward to enjoying the 
best of two worlds—the complete control by teachers 
within each school of the educational objectives that they 
wish to pursue, and examinations, specifically designed to 
assess these objectives, but nevertheless yielding assess- 
ments that are comparable over the country as a whole. 


108 


9 


Conclusion 


The role of measurement in the total educational process 
cannot be precisely defined because the process itself is 
subject to continuous modification. We are constantly re- 
viewing the objectives that we should try to pursue and 
seeking improvements in the forms of organisation and 
methods of instruction that are designed to serve our selec- 
ted aims. There is inevitably, therefore, a recurrent need 
to examine afresh the part that educational measurement 
can usefully play in these changing circumstances. 

During the course of this monograph we have tried 
to indicate, in the light of discernible current trends, the 
kinds of responsibility that teachers might reasonably be 
expected to assume in this regard. By way of conclusion 
We propose to draw together some of these suggestions in 
an attempt to set educational measurement in perspective. 

The following considerations indicate an apparent need 
for teachers in the future to become increasingly conver- 
sant with the principles and techniques of educational 
meaurement. 


School-based examinations in secondary schools 


We have referred to the growing tendency—a trend which 
is widely held to be desirable—for the functions of teach- 
ing and assessments to become more closely integrated. 

In the previous chapter we discussed the opportunities 
that are currently available to teachers in some schools 


for controlling the examinations that lead to the award 
109 


CONCLUSION 


of the Certificate of Secondary Education. The evidence so 
far accumulated suggests that school-based examinations 
can be made viable and effective instruments for this pur- 
pose and, if this proves to be the case, teachers might well 
be invited to assume responsibility for other examinations 
that are at present externally administered. 


The eleven-plus examination 


A somewhat different situation—but one which may have 
comparable implications—has arisen concerning the ex- 
ternal examination that has been used to determine the 
allocation of primary school leavers to courses of secon- 
dary education. The ‘eleven-plus’ examination would seem 
to be destined to disappear from the scene as schemes for 
the re-organisation of secondary education come to be 
implemented. It is arguable, however, that this too may 
involve teachers in additional responsibilities for the assess- 
ment of their pupils’ abilities and aptitudes. The re- 
organisation of secondary schools along comprehensive 
lines is unlikely to involve identical forms of educational 
treatment for all the children concerned and it is reason- 
able to suppose that a good deal of the information that 
was formerly obtained by means of the procedures used 
by local education authorities will still be required for 
the purposes of educational guidance. 


Guidance and counselling 
These two major Processes of assessment, one at the end 
of the primary and the other at the end of the secondary 
school course, clearly do not provide all the evidence that 
is needed for the effective guidance of pupils throughout 
the whole of their school careers. This was true in the past 
and is now more patently so. As schools, secondary schools 
in particular, become larger, more heterogeneous in com- 
position, and offer an increasingly more varied array of 
courses, close and continuous educational guidance will 


be needed to ensure that curricula and methods are suited 
IIO 


CONCLUSION 


as far as possible to each child's emergent needs. It is argu- 
able, we would suggest, that this responsibility cannot be 
satisfactorily discharged without the skilled use of objec- 
tive measuring devices. 


Changes in methods of teaching 


It is not only in order to assign pupils to appropriate 
courses in large secondary schools that a careful assessment 
of educational needs and capabilities will be required but 
also, it would seem, to determine within each classroom— 
at all levels—the specific kind of teaching that is appro- 
priate for each pupil. The current emphasis on the desir- 
ability of ‘individualistic instruction’ is firmly based on the 
recognition both that each individual pupil manifests a 
unique pattern of characteristics which calls for a bespoke 
educational treatment and that the means to satisfy this re- 
quirement are gradually becoming available. The growing 
provision of teaching machines, programmed texts and 
other aids makes it feasible to predict that the traditional 
forms of class teaching will be largely abandoned to make 
way for the establishment of adequately equipped ‘learn- 
ing laboratories’. 


Curriculum evaluation 


The prevailing climate of opinion is clearly favourable 
to an increased involvement on the part of teachers both 
in the process of curriculum planning and in the evaluation 
of the progress that their own pupils make towards attain- 
ing the objectives that such planning envisages. 

This may be seen as an aspect of the trend away from 
external examinations which involve, to some extent, the 
imposition upon a school of a curriculum that does not 
necessarily satisfy the requirements of the teachers and 
pupils concerned. 

What is at present bein, 
teachers themselves should 
objectives that they and their 


g strongly advocated is that 
be encouraged to define the 


pupils ought to pursue and 
III 


CONCLUSION 


that they should be free to design curricula and to choose 
methods that they consider will best serve their purposes. 

Such an extension of a teacher’s freedom of choice must 
of course carry with it the responsibility for ensuring that 


the effectiveness of the procedures they adopt is objectively 
appraised. 


Research 


Finally, we have pointed to the steadily growing volume 
of educational research with which teachers must inevit- 
ably be associated, They will often be invited to partici- 
pate in the conduct of investigations which involve 
educational measurements of various kinds and, whether 
or not they take part in such research, they will clearly 
be affected by the results that it yields. Without some 
insight into the rationale of methods employed in educa- 
tional research and particularly into the kinds of measure- 


ment that are involved they cannot adequately evaluate 
the findings that emerge. 


The trends that we have described indicate, we would 
Suggest, that some teachers, at any rate, should become 
familiar with the principles of educational measurement 
and practised in its techniques. It would of course be im- 
practicable to propose that every teacher could or should 
become an expert in this field. It would nevertheless seem 
to be desirable that some teachers in every area, perhaps 
one or two in every sizeable school, should become 
equipped to carry out measurement effectively and to 


assist their colleagues with the design and interpretation of 
such tests and examinations as 


ired to 
administer. rae Ee 

Furthermore, it would seem to be desirable that every 
teacher should be sufficiently acquainted with the prin- 
ciples of measurement to enable him to appreciate what it 
is that his more specially qualified colleagues are endea- 
vouring to accomplish. By way of analogy it might be 


reasonably argued that it is unnecessary for every medical 
112 


CONCLUSION 


practitioner to be equipped with and trained in the use of 
all the diagnostic instruments that are used in modern 
hospitals. Nevertheless we should think rather poorly of 
our family doctor if he were not aware of the potential 
value of these measuring devices, if he failed to recognise 
the circumstances in which they can be profitably 
employed, and if he were totally incapable of interpreting 
and acting upon the information that they yielded. 

We should also like to re-emphasise a point that we made 
earlier. It must be acknowledged that, for a variety of 
purposes, teachers can legitimately rely on their day-to-day 
observations and subjective impressions. It is conceivable, 
for example, that satisfactory educational guidance can 
be afforded within a school by these means. If, however, 
it becomes necessary to report assessments—from one 
teacher to another or from one school to another—some 
more objective forms of measurement would seem to be 
required. 


Some necessary precautions 


We should be guilty of overstating the case we have been 
trying to represent if we failed to introduce a note of 
caution about the possible misuse of measuring instru- 
ments. 

First, it must be recognised that however carefully 
educational tests are devised or selected an element of error 
is invariably associated with the results that they yield. 
Thus, marks or scores can never be accepted at their face 
value. They provide only an estimate of the qualities or 
characteristics under review. This is recognisable enough 
when we encounter assessments expressed in words or, 
perhaps, in literal grades. It is possible, however, when we 
are dealing with numerical scores to succumb to the temp- 
tation of exaggerating their degree of accuracy and pre- 
cision. 

Furthermore, it is rarely the case that any important 
educational decision can be based solely on the indications 


provided by tests and examinations. It is usually necessary 
113 


CONCLUSION 


to take into account a variety of factors—some of them 
associated perhaps with the child’s unique background 
or with the particular circumstances within which the 
assessments are being carried out. In other words, educa- 
tional measurement does not serve as a substitute for a 
teacher’s judgment. It should be regarded rather as pro- 
viding additional useful evidence which might help to 
make that judgment more soundly based. 

Finally, we would point to the risk that a teacher might 
become so excessively preoccupied with the techniques of 
measurement and impressed by the results that emerge that 
he fails to recognise that one of his prime responsibilities 
is, in a sense, to seek to falsify some of the predictions that 
are based on test scores. For example, a poor performance 
in the eleven-plus examination indicates that a child is 
unlikely to reach a high academic standard in his second- 
ary school course. The proper attitude to adopt towards 
this, and other comparable forecasts, is that they represent 
not a final verdict but a conjecture which a teacher may 
choose to accept or reject, In some circumstances he may 
regard such a prediction as a provocative challenge and, 
if he succeeds in demonstrating its invalidity, even the 
ranks of psychometry will scarce forbear to cheer. 


114 


Suggestions for further reading 


Chapter 1 

CHAUNCY, H. and DOBBIN, J. E. (1963) Testing: Its Place 
in Education Today, New York: Harper & Row 

EGGLESTON, J. F. (1965) A Critical Review of Assessment 
Procedures in Secondary School Science, Leicester : 
School of Education 

WISEMAN, S., ed. (1961) Examinations and English Educa- 
tion, Manchester University Press 


Chapter 2 

LORGE, 1. (1950) ‘The fundamental nature of measure- 
ment’, Chapter 14 in LINDQUIST, E. Fo Educational 
Measurement, Washington : American Council of Educa- 
tion 

REMMERS, H. H. and GAGE, N. L. (1955) Educational 
Measurement and Evaluation, New York: Harper & 
Row 

STEVENS, S. S. (1951) ‘Mathematics, Measurement and 
Psychophysics’, in STEVENS, S. s., ed. Handbook of 
Experimental Psychology, New York: John Wiley 

THORNDIKE, R. L. and HAGEN, E. (1955) Measurment and 
Evaluation in Psychology and Education, New York: 
John Wiley 

TYLER, L. E. (1963) Tests and Measurements, 
Prentice Hall 


New York: 


Chapter 3 
BLOOM, B. S. (1956) Taxonomy of Educat. 


London: Longmans Green . . 
EBEL, R. L. (1965) Measuring Educational Achievement, 


New York: Prentice Hall 


ional Objectives, 


115 


SUGGESTIONS FOR FURTHER READING 


FURST, E. J. (1957) Constructing Evaluation Instruments, 
London: Longmans Green , 

LINDQUIST, E. F., ed. (1950) op. cit. , 

THORNDIKE, R. L. and HAGEN, E. (1955) op. cit. 


hapter > 
re ay B. (1964) Educational Measurements and their 
Interpretation, California: Wordsworth Publishing Co. 
GARRETT, H. E. (1964) Statistics in Psychology and Educa- 
tion, London: Longmans Green ae 
GUILDFORD, J. P. (1956) Fundamental Statistics in Psycho- 
logy and Education, New York: McGraw Hill — 
LINDQUIST, E. F. (1942) First Course in Statistics, New 
York: Houghton Mifflin 
THORNDIKE, R. L. and HAGEN, E. (1955) op. cit., Chapter 5 
TYLER, L. E. (1963) op. cit., Chapter 2 7 
VERNON, P. E. (1940) The Measurement of abilities, Univer- 
sity of London Press 


Chapter 5 P 

CRONBACH, L. J. (1960) Essentials of Psychological Testing, 
Chapters 5 and 6, New York: Harper & Row 

CURETON, E. E. (1950) ‘Validity’, Chapter 16 in LINDQUIST, 
E. F., Op. cit. 

DAVIS, F. B, (1950) ‘Item Selection Techniques’, Chapter 9 
1N LINDQUIST, E. F., Op. cit. 

GUILDFORD, J. P. (1954) Psychometric Methods, Chapters 
14 and 15, New York: McGraw Hill 


NUNNALLY, J. C. (1967) Psychometric Theory, Chapters 3 
and 7, New York: McGraw Hill 


THORNDIKE, R. L. (1950) ‘Reliability’, Chapter 15 in LIND- 
QUIST, E. F., op. cit. 


THORNDIKE, R. L. and HAGEN, E, (1955) op. cit., Chapter 7 
Chapter 6 


FLANAGAN, J. C. (1950) ‘Units, Scores and Norms’, Chapter 
17 1N LINDQUIST, E. F., op. cit. 

GUILDFORD, J. P. (1956) op. cit. 

THORNDIKE, R. L. and HAGEN, E, (1955) op. cit., Chapter 6 

Chapter 7 


VERNON, P. E. (1960) Intelligence and Attainment tests, Uni- 
versity of London Press 
116 


SUGGESTIONS FOR FURTHER READING 


VERNON, P. E. (1961) The Structure of Human Abilities, Uni- 


versity of London Press 
THOMPSON, G. (1951) The Factoral Analysis of Human 


Ability, University of London Press 


Chapter 8 
H.M.S.O. (1965) Examinations Bulletin No. 5: Schools Coun- 


cil 
YATES, A. and PIDGEON, D. A. (1957) Admission to Grammar 


Schools, London: Newnes 
woop, R. (1968) ‘The Place and Value of Item Banking’, 


Educational Research, Vol. 2, No. 2 


pense 
\ Sep Calcutta > 
WB ig BP 


The Students Library of 
Education 
General Editor: J. W. Tibble 
Emeritus Professor of Education, 
University of Leicester 


Editorial Board: Ben Morris, 
Richard Peters, Brian Simon and 
William Taylor 


This series has been designed to meet 
the needs of students of Education at 
Colleges of Education and at University 
Institutes and Departments. It will also 
be valuable for practising teachers and 
educationists. The series takes full 
account of the latest developments in 
teacher-training and of new methods 
and approaches in education. Separate 
volumes will provide authoritative and 
up-to-date accounts of the topics 
within the major fields of sociology, 
philosophy and history of education, 
educational Psychology, and method. 
Care has been taken that specialist 
topics are treated lucidly and use- 
fully for the non-specialist reader. 
Altogether, the Students’ Library of 
Education will Provide a comprehen- 
sive introduction and guide to anyone 
concerned with the study of education, 


and with educational theory and 
Practice. 


SBN 7100 6247 8 


Printed in Great Britain TY 


The Students Library of Education 


*These titles are available in library editions only, 
available in two editions, library and paperback 


Volumes already published f 
*THE STUDY OF EDUCATION 


METHOD ETC. 
CHANGING AIMS IN RELIGIOUS EDUCATION 
SPELLING: CAUGHT OR TAUGHT? 


HISTORY 


THE AMERICAN INFLUENCE ON 
ENGLISH EDUCATION 


THE FRENCH INFLUENCE ON 
ENGLISH EDUCATION 
THE GERMAN INFLUENCE ON 
ENGLISH EDUCATION 
SOCIAL CHANGE AND THE SCHOOLS: 1918-1944 


THE FOUNDATIONS OF TWENTIETH-CENTURY 
EDUCATION 


“MEDIAEVAL EDUCATION AND THE REFORMATION 


THE EVOLUTION OF THE 
COMPREHENSIVE SCHOOL 


RECENT EDUCATION FROM LOCAL SOURCES 
CULTURE, INDUSTRIALISATION AND EDUCATION 


PHILOSOPHY 
*THE PHILOSOPHY OF PRIMARY EDUCATION 


EDUCATION AND THE CONCEPT OF 
MENTAL HEALTH 


PERSPECTIVES ON PLOWDEN 


PSYCHOLOGY 
GROUP STUDY FOR TEACHERS 
STUDENTS INTO TEACHERS 


AN OUTLINE OF PIAGET'S DEVELOPMENTAL. 
PSYCHOLOGY 


SOCIOLOGY 
THE SOCIAL CONTEXT OF THE SCHOOL 


FRS. Peters, 


_Mildred Collins 


Edwin Cox 
M. L. Peters 


(7 


W. H. G. Armytage 
W. H. G. Armytage 


W. H. G. Armytage 
Gerald Bernbaum 


E. Eaglesham 

J. Lawson 

David Rubinstein © =. 
and Brian Simon 


Malcolm Seaborne 
G. H. Bantock 


R. F. Dearden 


John Wilson 


Elizabeth Richardson 
Ruth M. Beard 


S. John Eggleston 


